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Abstract 

Several research teams have recently been working toward the development of practical general- 
purpose protocols for verifiable computation. These protocols enable a computationally weak verifier to 
offload computations to a powerful but untrusted prover, while providing the verifier with a guarantee 
that the prover performed the requested computations correctly. Despite substantial progress, existing 
implementations require further improvements before they become practical for most settings. The main 
bottleneck is typically the extra effort required by the prover to return an answer with a guarantee of 
^. correctness, compared to returning an answer with no guarantee. 

We describe a refinement of a powerful interactive proof protocol originally due to Goldwasser, 

Kalai, and Rothblum J2T[ . Cormode, Mitzenmacher, and Thaler 1 14 1 show how to implement the prover 

in this protocol in time O (S(n) log S(n)), where S(n) is the size of an arithmetic circuit computing the 

Mm function of interest. Our refinements apply to circuits whose wiring pattern is sufficiently "regular"; for 

\*J these circuits, we bring the runtime of the prover down to 0(S(n)). That is, our prover can evaluate the 

circuit with a guarantee of correctness, while suffering only a constant-factor blowup in work compared 

, ^-*, to evaluating the circuit without any guarantee. 

We argue that our refinements capture a large class of circuits, and prove some theorems formaliz- 
t-H ing this notion. We complement our theoretical results with experiments on problems such as matrix 

multiplication and determining the number of distinct elements in a data stream. Experimentally, our 
refinements yield a 200x speedup for the prover over the implementation of Cormode et al., and our 
prover is less than lOx slower than a C++ program that simply evaluates the circuit. Along the way, we 
describe a special-purpose protocol for matrix multiplication that is of interest in its own right. 
I* Our final contribution is the design of an interactive proof protocol targeted at general data parallel 

computation. Compared to prior work, this protocol can more efficiently verify complicated computa- 
tions as long as that computation is applied independently to many different pieces of data. 

Our refinements substantially advance the goal of achieving a truly practical general purpose imple- 
£> mentation of interactive proofs. 
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1 Introduction 

Protocols for verifiable computation enable a computationally weak verifier V to offload computations to a 
powerful but untrusted prover V. These protocols aim to provide the verifier with a guarantee that the prover 
performed the requested computations correctly, without requiring the verifier to perform the computations 
herself. 

Surprisingly powerful protocols for verifiable computation were discovered within the computer science 
theory community several decades ago, in the form of interactive proofs (IPs) and their brethren, interactive 
arguments (IAs) and probabilistically checkable proofs (PCPs). In these protocols, the prover V solves a 
problem using her (possibly vast) computational resources, and tells V the answer. V and V then have a 
conversation, i.e., they engage in a randomized protocol involving the exchange of one or more messages. 
During this conversation, "P's goal is to convince V that the answer is correct. 

Results quantifying the power of IPs, IAs, and PCPs represent some of the most celebrated results in 
all of computational complexity theory, but until recently they were mainly theoretical curiosities, far too 
inefficient for actual deployment. In fact, the main applications of these results have traditionally been in 
negative applications - showing that many problems are just as hard to approximate as they are to solve 
exactly. 

However, the surging popularity of cloud computing has brought renewed interest in positive appli- 
cations of protocols for verifiable computation. A typical motivating scenario is as follows. A business 
processes billions or trillions of transactions a day. The volume is sufficiently high that the business cannot 
or will not store and process the transactions on its own. Instead, it offloads the processing to a commercial 
cloud computing service. The offloading of any computation raises issues of trust: the business may be con- 
cerned about relatively benign events like dropped transactions, buggy algorithms, or uncorrected hardware 
faults, or the business may be more paranoid and fear that the cloud operator is deliberately deceptive or has 
been externally compromised. Either way, each time the business poses a query to the cloud, the business 
may demand that the cloud also provide a guarantee that the returned answer is correct. 

This is precisely what protocols for verifiable computation accomplish, with the cloud acting as the 
prover in the protocol, and the business acting as the verifier. In this paper, we describe a refinement of an 
existing general -purpose protocol originally due to Goldwasser, Kalai, and Rothblum [14|2TJ. When they 



are applicable, our techniques achieve asymptotically optimal runtime for the prover, and we demonstrate 
that they yield protocols that are significantly closer to practicality than that achieved by prior work. 

We also make progress toward addressing another issue of existing interactive proof implementations: 
their applicability. The protocol of Goldwasser, Kalai, and Rothblum (henceforth the GKR protocol) ap- 
plies in principle to any problem computed by a small-depth arithmetic circuit, but this is not the case when 
more fine-grained considerations of prover and verifier efficiency are taken into account. In brief, existing 
implementations of interactive proof protocols for circuit evaluation all require that the circuit have a highly 



regular wiring pattern |14||40| . If this is not the case, then these implementations require the verifier to per- 
form an expensive (though data-independent) preprocessing phase to pull out information about the wiring 
of the circuit, and they require a substantial factor blowup (logarithmic in the circuit size) in runtime for 
the prover relative to evaluating the circuit without a guarantee of correctness. Developing a protocol that 
avoids these pitfalls and applies to more general computations remains an important open question. 

Our approach is the following. We do not have a magic bullet for dealing with irregular wiring patterns; 
if we want to avoid an expensive pre-processing phase for the verifier and minimize the blowup in runtime 
for the prover, we do need to make an assumption about the structure of the circuit we are verifying. Ac- 
knowledging this, we ask whether there is some general structure in real-world computations that we can 



leverage for efficiency gains. 

To this end, we design a protocol that is highly efficient for data parallel computation. By data paral- 
lel computation, we mean any setting in which one applies the same computation independently to many 
pieces of data. Many outsourced computations are data parallel, with Amazon Elastic MapReducqj being 
one prominent example of a cloud computing service targeted specifically at data parallel computations. 
Crucially, we do not want to make significant assumptions on the sub-computation that is being applied, and 
in particular we want to handle sub-computations computed by circuits with highly irregular wiring patterns. 

The verifier in our protocol still has to perform an offline phase to pull out information about the wiring 
of the circuit, but the cost of this phase is proportional to the size of a single instance of the sub-computation, 
avoiding any dependence on the number of pieces of data to which the sub-computation is applied. Similarly, 
the blowup in runtime suffered by the prover is the same as it would be if the prover had run the basic GKR 
protocol on a single instance of the sub-computation. 

Our final contribution is to describe a new protocol specific to matrix multiplication that is of interest in 
its own right. It avoids circuit evaluation entirely, and reduces the overhead of the prover (relative to running 
any unverifiable algorithm) to an additive low-order term. 

A major message of our results is that the more structure that exists in a computation, the more effi- 
ciently it can be verified both in theory and in practice, and that this structure exists in many real-world 
computations. 

1.1 Prior Work 

1.1.1 Work on Interactive Proofs 

Goldwasser, Kalai, and Rothblum described a powerful general-purpose interactive proof protocol in [21]. 
This protocol is framed in the context of circuit evaluation. Given a layered arithmetic circuit C of depth 
d, size S(n), and fan-in 2, the GKR protocol allows a prover to evaluate C with a guarantee of correctness 
in time poly(5(n)), while the verifier runs in time 0(n + dlogS(n)), where n is the length of the input and 
the O notation hides polylogarithmic factors in n. Thus, for circuits of polynomial size and sublinear depth, 
the verifier's runtime is quasilinear in the input length. This can be much smaller than the total size of the 
circuit, thereby saving the verifier substantial time relative to executing the computation locally. 

Cormode, Mitzenmacher, and Thaler showed how to bring the runtime of the prover in the GKR protocol 



down from poly(5(n)) to O (S(n) log S(n)) |14|. They also built a full implementation of the protocol and 
ran it on benchmark problems. These results demonstrated that the protocol does indeed save the verifier 
significant time in practice (relative to evaluating the circuit locally); they also demonstrated surprising scal- 
ability for the prover, although the prover's runtime remained a major bottleneck. With the implementation 
of [ 14] as a baseline, Thaler, Roberts, Mitzenmacher, and Pfister described a parallel implementation of 
the GKR protocol that achieved 40x-100x speedups for the prover and lOOx speedups for the (already fast) 
implementation of the verifier [38 ]. 



Vu, Setty, Blumberg, andWalfish [40] further refine and extend the implementation of Cormode, Mitzen- 
macher, and Thaler. Their contributions can be summarized as follows. First, they combine the GKR proto- 
col (plus some refinements) with a compiler from a high-level programming language so that programmers 
do not have to explicitly express computation in the form of arithmetic circuits as was the case in the im- 
plementation of (14") . This substantially extends the reach of the implementation, but it should be noted that 
their approach generates circuits with irregular wiring patterns, and therefore requires an expensive offline 
setup phase for the verifier (which can be amortized if many instances of a single computation are verified 
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in batch). Second, they build a hybrid system that statically evaluates whether it is better to use the GKR 



protocol or a different, cryptography-based argument system called Zaatar (see Section 1.1.2 1, and runs the 
more efficient of the two protocols in an automated fashion. 

A growing line of work studies protocols for verifiable computation in the context of data streaming. 
In this context, the goal is not just to save the verifier time (compared to doing the computation without a 
prover), but also to save the verifier space. This is motivated by cloud computing settings where the client 
does not even have space to store a local copy of the input. The protocols developed in this line of work 
allow the client to make a single streaming pass over the input (which can occur, for example, while the 
client is uploading data to the cloud), keep only a very small summary of the data set. The interactive 
version of this model was introduced and dubbed streaming interactive proofs by Cormode, Thaler, and 
Yi [15 1, who observed that many protocols from the IP literature, including the GKR protocol, can be 
made to work in this restrictive setting. The observations of fT5| imply that all of our protocols also work 
with streaming verifiers. Non-interactive variants of the streaming interactive proofs model have also been 
studied in detail (l2|[T3j|23j|27j. 



1.1.2 Work on Argument Systems 

There has been a lot of work on the development of efficient interactive arguments, which are essentially 
interactive proofs that are secure only against dishonest provers that run in polynomial time. A substantial 
body of work in this area has focused on the development of protocols targeted at specific problems (e.g. (2] 
|5p6|). Other works have focused on the development of general-purpose argument systems. Several papers 
in this direction (e.g. |8j[T0|[TT][T8j) have used fully homomorphic encryption [20], which unfortunately 
remains impractical despite substantial recent progress. Work in this category by Chung et al. [10] focuses 
on streaming settings, and is therefore particularly relevant. 

Very recently, several research teams have been pursuing the development of general-purpose argument 
systems that might be suitable for practical use. Theoretical work by Ben-Sasson et al. [4] focuses on the 
development of short PCPs that might be suitable for practical use. Such PCPs can be compiled into efficient 
interactive arguments, though it is not yet clear how the PCP constructions of Q will perform in practice. 

As short PCPs are often a bottleneck in the development of efficient argument systems, other works 
have focused on avoiding their use [3,6,7, 19,30]. Gennaro et al. [19], building on work of Groth [22] and 
Lipmaa [28], develop a model of computation called Quadratic Span Programs (QSPs) that is amenable to 
fast checking using cryptographic tools, but that also allows for fast reductions from a more practical model 
of computation (namely, circuit satisfiability). Bitansky et al. (9j also consider argument systems that avoid 



the use of short PCPs. Very recent work by Parno et al. 1 30 1 describes a near-practical general-purpose 
implementation of an argument system based on [JT9J. 

Another line of implementation work focusing on general-purpose interactive argument systems is due 



to Setty et al. [34-36]. This line of work begins with a base argument system due to Ishai et al. [25], and 
substantially refines the theory to achieve an implementation that approaches practicality. Setty et al. call 
their most recent implementation Zaatar [36] (which is also based on the work of Gennaro et al. [19]), 



with earlier implementations referred to as Ginger [35] and Pepper [34]. Notably, while Ginger and Pepper 



suffered from quadratic overhead for the prover, Zaatar achieves polylogarithmic overhead for the prover. 
Two differences between the GKR-based approach and the interactive arguments of |25j 34 -[36J are 



worth highlighting. The first is that the argument systems make use of cryptography (which is why sound- 
ness only holds against polynomial time provers for these systems) while GKR does not. The second is 
that Zaatar and Ginger can only save the verifier time in a batching model, while the GKR approach can 
save the verifier time even when outsourcing a single computation (the argument system of Parno et al. |30) 



also requires an expensive one-time setup phase, but this phase can be amortized over many batches). An 
empirical comparison of the GKR-based approach and Zaatar performed by Vu et al. [40] finds the GKR 
approach to be more efficient than Zaatar and Ginger for programs with relatively simple control flow, while 
Zaatar and Ginger are appropriate for programs with more complicated control flow, largely because these 
programs cannot obviously be computed by succinct arithmetic circuits, which is the computational model 
used by GKR. 

1.2 Our Contributions 

Our primary contributions are three-fold. Our first contribution addresses one of the biggest remaining 
obstacles to achieving a truly practical implementation of the GKR protocol: the logarithmic factor overhead 
for the prover. That is, Cormode et al. show how to implement the prover in time O (S(n) log S(n)), where 
S(n) is the size of the arithmetic circuit to which the GKR protocol is applied, down from the Q.(S(n) 3 ) 
time required for a naive implementation. The hidden constant in the Big-Oh notation is at least 3, and the 
logS(n) factor translates to well over an order of magnitude, even for circuits with a few million gates. 

We remove this logarithmic factor, bringing V's runtime down to 0(S(n)) for a large class of circuits. 
Informally, our results apply to any circuit whose wiring pattern is sufficiently "regular". We formalize the 
class of circuits to which our results apply in Theorem[T] 

We experimentally demonstrate the generality and effectiveness of Theorem [T] via two case studies. 
Specifically, we apply an implementation of the protocol of Theorem [T] to a circuit computing matrix mul- 
tiplication (MATMULT), as well as to a circuit computing the number of distinct items in a data stream 
(DISTINCT). Experimentally, our refinements yield a 200x-250x speedup for the prover over the state of the 
art implementation of Cormode et al. | [T4| . A serial implementation of our prover is 5x-10x slower than a 
C++ program that simply evaluates the circuit sequentially, a slowdown that is tolerable in realistic outsourc- 
ing scenarios where cycles are plentiful for the prover. Moreover, a parallel implementation of our prover 
using a graphics processing unit (GPU) is roughly 30x faster than our serial implementation, and therefore 
takes less time than that required to evaluate the circuit in serial. 

Our second contribution is to specify a highly efficient protocol for verifiably outsourcing arbitrary 
data parallel computation. Compared to prior work, this protocol can more efficiently verify complicated 
computations, as long as that computation is applied independently to many different pieces of data. We 
formalize this protocol and its efficiency guarantees in Theorem |2| 

Our third contribution is to describe a new protocol specific to matrix multiplication that we believe to be 
of interest in its own right. This protocol is formalized in Theorem [3] Given any unverifiable algorithm for 
nxn matrix multiplication that requires time T(n) using space s(n), Theorem [5] allows the prover to run in 
time T(n) + 0(n 2 ) using space s(n) + o(n 2 ). Note that TheoremBK which is specific to matrix multiplication) 
is much less general than Theorem [T]( which applies to any circuit with a sufficiently regular wiring pattern). 
However, Theorem [3] achieves optimal runtime and space usage for the prover up to leading constants, 
assuming there is no 0{n 2 ) time algorithm for matrix multiplication. While these properties are also satisfied 
by a classic protocol due to Freivalds [17], the protocol of Theorem [3] is significantly more amenable for 
use as a primitive when verifying computations that repeatedly invoke matrix multiplication. For example, 
using the protocol of Theorem [3] as a primitive, we give a natural protocol for computing the diameter of 
an unweighted directed graph G. V's runtime in this protocol is 0(m\ogn), where m is the number of 
edges in G, V's runtime matches the best known unverifiable diameter algorithm up to a low-order additive 



term [33 42 1, and the total communication is just polylog(«). We know of no other protocol achieving this. 



We complement Theorem [3] with experimental results demonstrating its extreme efficiency. 



1.3 Roadmap 

Section [2] presents preliminaries. We give a high-level overview of the ideas underlying our main results in 
Section [3j Section [4] gives a detailed overview of prior work, including the standard sum-check protocol as 
well as the GKR protocol. Section [5] contains the details of our time-optimal protocol for circuit evaluation 
as formalized in Theorem [T] Section [6] describes our experimental cases studies of the protocol described 
in Theorem[T] Section [7] describes our protocol for arbitrary data parallel computation. Section [8] describes 
some additional optimizations that apply to specific important wiring patterns. In particular, this section 
describes our special-purpose protocol for MATMULT that achieves optimal prover efficiency up to leading 
constants. Section[9]concludes. 

2 Preliminaries 

2.1 Definitions 

We begin by defining a valid interactive proof protocol for a function /. 

Definition 1 Consider a prover V and verifier V who both observe an input x and wish to compute a function 
f : {0, 1}" —> IZfor some set 1Z. After the input is observed, V and V exchange a sequence of messages. 
Denote the output ofV on input x, given prover V and V's random bits R, by out(V,x,R,V). V can output 
-LifV is not convinced that V 's claim is valid. 

We say V is a valid prover with respect to V if for all inputs x, PrR{out(V,x,R,V) = f(x)] = 1. The 
property that there is at least one valid prover V with respect to V is called completeness. We say V is a 
valid verifier for f with soundness probability 8 if there is at least one valid prover V with respect to V, 
and for all provers V' and all inputs x, Pr[out(V ,A,R,T") ^ {/(x), _L}] < 8. We say a prover-verifier pair 
(V,V) is a valid interactive proof protocol/or/ ifV is a valid verifier for f with soundness probability 1/3, 
and V is a valid prover with respect toV. IfV and V exchange r messages in total, we say the protocol has 
\r/2\ rounds. 

Informally, the completeness property guarantees that an honest prover will convince the verifier that 
the claimed answer is correct, while the soundness property ensures that a dishonest prover will be caught 
with high probability. An interactive argument is an interactive proof where the soundness property holds 
only against polynomial-time provers V'. 

We remark that the constant 1/3 used for the soundness probability in Definition [I] is chosen for consis- 
tency with the interactive proofs literature, where 1 /3 is used by convention. In our actual implementation, 
the soundness probability will always be less than 2~ 45 . 

2.1.1 Cost Model 

Whenever we work over a finite field F, we assume that a single field operation can be computed in a single 
machine operation. For example, when we say that the prover V in our interactive protocols requires time 
0(S(n)), we mean that V must perform 0(S(n)) additions and multiplications within the finite field over 
which the protocol is defined. 



Input Representation. Following prior work |12p4 15 1, all of the protocols we consider can handle inputs 



specified in a general data stream form. Each element of the stream is a tuple (z, 8), where i G [n] and 8 is 
an integer. The 8 values may be negative, thereby modeling deletions. The data stream implicitly defines a 



frequency vector a, where a, is the sum of all 8 values associated with i in the stream. For simplicity, we 
assume throughout the paper that the number of stream updates m is related to n by a constant factor i.e. 
m = ©(«). 

When checking the evaluation of a circuit C, we consider the inputs to C to be the entries of the frequency 
vector a. We emphasize that in all of our protocols, V only needs to see the raw stream and not the aggregated 
frequency vector a (see Lemma [2] for details). Notice that we may interpret the frequency vector a as an 
object other than a vector, such as a matrix or a string. For example, in MATMULT, the data stream defines 
two matrices to be multiplied. 

When we refer to a streaming verifier with space usage s{n), we mean that the verifier can make a single 
pass over the stream of tuples defining the input, regardless of their ordering, while storing at most s(n) 
elements in the finite field over which the protocol is defined. 

2.1.2 Problem Definitions 

To focus our discussion in this paper, we give special attention to two problems also considered in prior 
work (14j[38j. 



1. In the MATMULT problem, the input consists of two n x n matrices A,B g Z" x ", and the goal is to 
compute the matrix product A • B. 

2. In the DISTINCT problem, also denoted Fq, the input is a data steam consisting of m tuples (z, 8) from 
a universe of size n. The stream defines a frequency vector a, and the goal is to compute \{i : a,- ^ 0}|, 
the number of items with non-zero frequency. 

2.1.3 Additional Notation 

Throughout, [n] will denote the set {1, . . . ,n}, while [[«]] will denote the set {0, . . . ,n — 1}. 

Let F be a field, and F* = F \ {0} its multiplicative group. For any <i-variate polynomial p(x\ , . . . ,Xd) '■ 
¥ d — > F, we use deg,(p) to denote the degree of p in variable i. A J-variate polynomial p is said to be 
multilinear if deg ( (p) = 1 for all i G [d\. Given a function V : {0, \] d — > {0, 1} whose domain is the d- 
dimensional Boolean hypercube, the multilinear extension (MLE) of V over F, denoted V, is the unique 
multilinear polynomial ¥ d — >• F that agrees with V on all Boolean-valued inputs. That is, V is the unique 
multilinear polynomial over F satisfying V(x) = V(x) for all x G {0, \} d . 

3 Overview of the Ideas 

We begin by describing the methodology underlying the GKR protocol before summarizing the ideas un- 
derlying our improved protocols. 

3.1 The GKR Protocol Fom 10,000 Feet 

In the GKR protocol, V and V first agree on an arithmetic circuit C of fan- in 2 over a finite field F computing 
the function of interest (C may have multiple outputs). Each gate of C performs an addition or multiplication 
over F. C is assumed to be in layered form, meaning that the circuit can be decomposed into layers, and 
wires only connect gates in adjacent layers. Suppose the circuit has depth d; we will number the layers from 
1 to d with layer d referring to the input layer, and layer 1 referring to the output layer. 



In the first message, V tells V the (claimed) output of the circuit. The protocol then works its way in 
iterations towards the input layer, with one iteration devoted to each layer. The purpose of iteration i is to 
reduce a claim about the values of the gates at layer i to a claim about the values of the gates at layer i + 1, 
in the sense that it is safe for V to assume that the first claim is true as long as the second claim is true. This 
reduction is accomplished by applying the standard sum-check protocol |29| to a certain polynomial. 

More concretely, the GKR protocol starts with a claim about the values of the output gates of the circuit, 
but V cannot check this claim without evaluating the circuit herself, which is precisely what she wants to 
avoid. So the first iteration uses a sum-check protocol to reduce this claim about the outputs of the circuit to 
a claim about the gate values at layer 2 (more specifically, to a claim about an evaluation of the multilinear 
extension (MLE) of the gate values at layer 2). Once again, V cannot check this claim herself, so the second 
iteration uses another sum-check protocol to reduce the latter claim to a claim about the gate values at layer 
3, and so on. Eventually, V is left with a claim about the inputs to the circuit, and V can check this claim on 
her own. 

In summary, the GKR protocol uses a sum-check protocol at each level of the circuit to enable V to 
go from verifying a randomly chosen evaluation of the MLE of the gate values at layer i to verifying a 
(different) evaluation of the MLE of the gate values at layer i + 1. Importantly, apart from the input layer 
and output layer, V does not ever see all of the gate values at a layer (in particular, V does not send these 
values in full). Instead, V relies on V to do the hard work of actually evaluating the circuit, and uses the 
power of the sum-check protocol as the main tool to force V to be consistent and truthful over the course of 
the protocol. 

3.2 Achieving Optimal Prover Runtime for Regular Circuits 

In Theorem[T] we describe an interactive proof protocol for circuit evaluation that brings Vs runtime down 
to 0(S(n)) for a large class of circuits, while maintaining the same verifier runtime as in prior implementa- 
tions of the GKR protocol. Informally, Theorem [T] applies to any circuit whose wiring pattern is sufficiently 
"regular". 

This protocol follows the same general outline as the GKR protocol, in that we proceed in iterations 
from the output layer of the circuit to the input layer, using a sum-check protocol at iteration i to reduce 
a claim about the gate values at layer i to a claim about the gate values at layer i + 1 . However, at each 
iteration i we apply the sum-check protocol to a carefully chosen polynomial that differs from the one used 
by GKR. In each round j of the sum-check protocol, our choice of polynomial allows V to reuse work from 
prior rounds in order to compute the prescribed message for round j, allowing us to shave a log»S(?i) factor 
from the runtime of V relative to the 0(S(n) log5'(«))-time implementation due to Cormode et al. (14) . 

Specifically, at iteration i, the GKR protocol uses a polynomial f^' defined over logSi + 21ogS'; + i vari- 
ables, where Sj is the number of gates at layer i. The "truth table" of /W is sparse on the Boolean hypercube, 
in the sense that f^\x) is non-zero for at most Si of the Sj ■ Sj +l inputs x G {0, i} lo § s /+ 21 °g 5 '+i . Cormode et 
al. leverage this sparsity to bring the runtime of V in iteration i down to 0(Sj log 5,-) from a naive bound of 
Q.(Sj -«Sf + i). Although the sparsity of the truth table of /W is crucial in achieving 0(Sj log Si) runtime, this 
same sparsity prevents V from reusing work from prior iterations as we seek to do. 

In contrast, we use a polynomial gW defined over only log Si variables rather than log Si + 21ogS,- + i 
variables. Moreover, the truth table of g® is dense on the Boolean hypercube, in the sense that g^''(x) may 
be non-zero for all of the 5, Boolean inputs x G {0, l} logS '. This density allows V to reuse work from prior 
iterations in order to speed up her computation in round i of the sum-check protocol. 

In more detail, in each round j of the sum-check protocol, the prover's prescribed message is defined 
via a sum over a large number of terms, where the number of terms falls geometrically fast with the round 
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number j. Moreover, it can be shown that in each round j, each gate at layer i + 1 contributes to exactly one 
term of this sum. Essentially, what we do is group the gates at layer i + 1 by the term of the sum to which 
they contribute. Each such group can be treated as a single unit, ensuring that in any round of the sum-check 
protocol, the amount of work V needs to do is proportional to the number of terms in the sum rather than 
the number of gates 5; at layer i. 

We remark that a similar "reuse of work" technique was implicit in an analysis by Cormode, Thaler, 
and Yi [15, Appendix B] of an efficient protocol for a specific streaming problem known as the second 
frequency moment. This frequency moment protocol was the direct inspiration for our refinements, though 
we require additional insights to apply the reuse of work technique in the context of evaluating general 
arithmetic circuits. 

It is worth clarifying why our methods do not yield savings when applied to the polynomial /w used in 
the basic GKR protocol. The reason is that, since /W is defined over logS, + 21ogS,-+i variables instead of 
just log Si variables, the sum defining "P's message in round j is over a much larger number of terms when 
using /W . It is still the case that each gate contributes to only one term of the sum, but until the number of 
terms in the sum falls below Si (which does not happen until round j = logS; + logS,-+i of the sum-check 
protocol), it is possible for each gate to contribute to a different term. Before this point, grouping gates by 
the term of the sum to which they contribute is not useful, since each group can have size 1. 

3.3 Verifying General Data Parallel Computations 

Theorem [T] only applies to circuits with regular wiring patterns, as do other existing implementations of 
interactive proof protocols for circuit evaluation fl4||40) . For circuits with irregular wiring patterns, these 
implementations require the verifier to perform an expensive preprocessing phase (requiring time propor- 
tional to the size of the circuit) to pull out information about the wiring of the circuit, and they require a 
substantial factor blowup (logarithmic in the circuit size) in runtime for the prover relative to evaluating the 
circuit without a guarantee of correctness. 

To address these bottlenecks, we do need to make an assumption about the structure of the circuit we are 
verifying. Ideally our assumption will be satisfied by many real-world computations. To this end, Theorem[2] 
describes a protocol that is highly efficient for any data parallel computation, by which we mean any setting 
in which one applies the same computation independently to many pieces of data. See Figure[2]in Section[7] 
for a schematic of a data parallel computation. 

The idea behind Theorem [2] is as follows. Let C be a circuit of size S with an arbitrary wiring pattern, 
and let C* be a "super-circuit" that applies C independently to B different inputs before possibly aggregating 
the results in some fashion. If one naively applied the basic GKR protocol to the super-circuit C* , V might 
have to perform a pre-processing phase that requires time proportional to the size of C*, which is Cl(B • 5). 
Moreover, when applying the basic GKR protocol to C*, V would require time &(B S • log (5 • S)). 

In order to improve on this, the key observation is that although each sub-computation C can have 
a very complicated wiring pattern, the circuit is "maximally regular" between sub-computations, as the 
sub-computations do not interact at all. Therefore, each time the basic GKR protocol would apply the 
sum-check protocol to a polynomial derived from the wiring predicate of C* , we instead use a simpler 
polynomial derived only from the wiring predicate of C. This immediately brings the time required by V in 
the pre-processing phase down to 0{S), which is proportional to the cost of executing a single instance of 
the sub-computation. By using the reuse of work technique underlying Theorem[T] we are also able to bring 
V's runtime down from &(B-S-\og(B-S)) to &(B-S-\ogS), i.e., V's requires a factor of O(logS) more 
time to evaluate the circuit with a guarantee of correctness, compared to evaluating the circuit without such 
a guarantee. This O(logS) factor overhead does not depend on the batch size B. 
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3.4 A Special-Purpose Protocol for matmult 

We describe a special-purpose protocol for n x n MATMULT in Theorem [3] The idea behind this protocol is 
as follows. The GKR protocol, as well the protocols of Theorems [T] and [2] only make use of the multilinear 
extension V,- of the function V; mapping gate labels at layer i of the circuit to their values. In some cases, 
there is something to be gained by using a higher-degree extension of V,-, and this is precisely what we 
exploit here. 

In more detail, our special-purpose protocol can be viewed as an extension of our circuit-checking 
techniques applied to a circuit C performing naive matrix multiplication, but using a quadratic extension 
of the gate values in this circuit. This allows us to verify the computation using a single invocation of the 
sum-check protocol. More importantly, V can evaluate this higher-degree extension at the necessary points 
without explicitly materializing all of the gate values of C, which would not be possible if we had used the 
multilinear extension of the gate values of C. 

In the protocol of Theorem [3] V just needs to compute the correct output (possibly using an algorithm 
that is much more sophisticated than naive matrix multiplication), and then perform 0(n 2 ) additional work 
to prove the output is correct. Since V does not have to evaluate C in full, this protocol is perhaps best viewed 
outside the lens of circuit evaluation. Still, the idea underlying Theorem[3]can be thought of as a refinement 
of our circuit evaluation protocols, and we believe that similar ideas may yield further improvements to 
general-purpose protocols in the future. 

4 Technical Background 

4.1 Schwartz-Zippel Lemma 

We will often make use of the following basic property of polynomials. 

Lemma 1 ( Q32) ) Let F be any field, and let f : ¥' n — >• F be a nonzero polynomial of total degree d. Then on 
any finite set S £ F m , 

Pr x ^ sm [f{x)=0]<d/\S\. 

In particular, any two distinct polynomials of total degree d can agree on at most d/\S\ fraction of points in 

S m . 

4.2 Sum-Check Protocol 

Our main technical tool is the sum-check protocol [29], and we present a full description of this protocol for 
completeness. See also [ 1 , Chapter 8] for a complete exposition and proof of soundness. 

Suppose we are given a v-variate polynomial g defined over a finite field F. The purpose of the sum- 
check protocol is to compute the sum: 

H -= I I ••• I S{bu-,b v ). 

feie{0,i}fe 2 e{0,i} b r e{o,i} 

In order to execute the protocol, the verifier needs to be able to evaluate g(r\ , . . . , r v ) for a randomly 
chosen vector {r\ , . . . , r v ) G F v - see the paragraph preceding Propositionfllbelow. 

The protocol proceeds in v rounds as follows. In the first round, the prover sends a polynomial g\(X\), 
and claims thatgi(Xi) =L.v 2 ,...,.r,,e{o.i} 11 g{Xi,X2,. ■ ■ ,x v ). Observe that if gi is as claimed, then// = gi(0) + 
gi(l). Also observe that the polynomial gi(X\) has degree deg^g), the degree of variable x\ in g. Hence 
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gi can be specified with degj (g) + 1 field elements. In our implementation, V will specify g by sending the 
evaluation of g at each point in the set {0, 1, . . . ,degj (g)}. 

Then, in round j > 1, V chooses a value rj_\ uniformly at random from F and sends rj_\ to V . We will 
often refer to this step by saying that variable j — 1 gets bound to value r/_i. In return, the prover sends a 
polynomial gj(Xj), and claims that 

8j( x j)= L §( r h---,rj-i,Xj,xj + i,...,x v ). (1) 

(x ;+ i,...^)e{0,l}^ 

The verifier compares the two most recent polynomials by checking gj_i(rj_i) = gj(0) + g/(l), and 
rejecting otherwise. The verifier also rejects if the degree of gj is too high: each gj should have degree 
deg -(g), the degree of variable xj in g. 

In the final round, the prover has sent g v (X v ) which is claimed to be g{r\ , . . . , r v _i,X v ). V now checks 
that g v (r v ) = g{r\ , . . . , r v ) (recall that we assumed V can evaluate g at this point). If this test succeeds, and 
so do all previous tests, then the verifier accepts, and is convinced that H = g\ (0) +gi (1). 

Proposition 1 Let g be a v-variate polynomial defined over a finite field F, and let (V, V) be the prover- 
verifier pair in the above description of the sum-check protocol. (V,V) is a valid interactive proof protocol 
for the function H = £/, l€ {0,l} £fe 2 e{0,l} • • •E* v e{0,l}S , (^i> • • • A)- 

4.2.1 Discussion of costs. 

Observe that there is one round in the sum-check protocol for each of the v variables of g. The total com- 
munication is Yd=\ deg,(g) + 1 = v + Yd=\ degj(g) field elements. In all of our applications, deg,(g) = 0(1) 
for all i, and so the communication cost is 0(y) field elements. 

The running time of the verifier over the entire execution of the protocol is proportional to the total 
communication, plus the amount of time required to compute g(r\ ,...,r v ). 

Determining the running time of the prover is less straightforward. Recall that V can specify gj by 
sending for each i G {0, . . ., deg -(g)} the value: 

8j(i)= Y. g( r h---,rj-i,i,Xj+i,...,x v ). (2) 

(x J+u ...,x v )e{0A}'-J 

An important insight is that the number of terms defining the value gj(i) in Equation (|2]) falls geo- 
metrically with j: in the j'th sum, there are only 2 v ~ y terms, each corresponding to a Boolean vector 
in {0, 1} V_7 . Thus, the total number of terms that must be evaluated over the course of the protocol is 
L/=i deg :(g)2 v ~-i = 0(2 V ). Consequently, if V is given oracle access to the truth table of the polynomial g, 
then V will require just 0(2 V ) time. 

Unfortunately, in our applications V will not have oracle access to the truth table of g. The key to our 
results is to show that in our applications V can nonetheless evaluate g at all of the necessary points in 0(2 V ) 
total time. 

4.3 The GKR Protocol 

We describe the details of the GKR protocol for completeness, as well as to simplify the exposition of our 
refinements. 
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4.3.1 Notation 

Suppose we are given a layered arithmetic circuit C of size S(n), depth d(n), and fan-in two. Let S; denote 
the number of gates at layer i of the circuit C. Assume 5; is a power of 2 and let 5; = 2 Si . In order to explain 
how each iteration of the GKR protocol proceeds, we need to introduce several functions, each of which 
encodes certain information about the circuit. 

To this end, number the gates at layer i from to S{ — 1, and let V, : {0, 1} V ' — > F denote the function that 
takes as input a binary gate label, and outputs the corresponding gate's value at layer i. The GKR protocol 



makes use of the multilinear extension V,- of the function V,- (see Section |2.1.3| ). 

The GKR protocol also makes use of the notion of a "wiring predicate" that encodes which pairs of 
wires from layer i + 1 are connected to a given gate at layer i in C. We define two functions, add, and mult, 
mapping {0, l} s <+ 2i m to {0, 1}, which together constitute the wiring predicate of layer i of C. Specifically, 
these functions take as input three gate labels {J1J2J3), and return 1 if gate j\ at layer i is the addition 
(respectively, multiplication) of gates ji and j'3 at layer 2 + 1, and return otherwise. Let add, and mult, 
denote the multilinear extensions of add, and mult, respectively. 

Finally, let P Si (z,p) denote the function 

Ps l (z,p) = fl((l-Zj)(l-p J )+ZjPj). 

7=1 

It is straightforward to check that fi Sj is the multilinear extension of the function B(x,y) : {0, l} Si x {0, 1}*' — > 
{0, 1} that evaluates to 1 if x = y, and evaluates to otherwise. 

4.3.2 Protocol Outline 

The GKR protocol consists of d(n) iterations, one for each layer of the circuit. Each iteration starts with 
V claiming a value for Vt(z) for some field element z € F 5 '. In the case of iteration one and circuits with a 
single output gate, z = and V\ (0) corresponds to the output value of the circuit. 

In the case of iteration one and circuits with many output gates, Vu et al. | [40| show that V may simply 
send V the (claimed) values of all output gates, thereby specifying a function V[ : {0, 1} ,S ' — > F claimed to 
equal V\ . V can pick a random point z £ F 5 ' and evaluate V'\ (z) on her own in 0(Si ) time (see RemarkMJin 
Section 4.3.5| >. A simple application of the Schwartz-Zippel Lemma (Lemma[T]) implies that it is safe for V 



to believe that V[ indeed equals V\ as claimed, as long as V\ (z) = V'\ (z). 

The purpose of iteration i is to reduce the claim about the value of V,(z) to a claim about Vi + \{(o) for 
some ft) € F v,+1 , in the sense that it is safe for V to assume that the first claim is true as long as the second 
claim is true. To accomplish this, the iteration applies the sum-check protocol described in Section 4.2 to a 
specific polynomial derived from V, + i, add,, and mult,, and jS i; . 

4.3.3 Details for Each Iteration 

Applying the Sum- Check Protocol. It can be shown that for any z £ F v ', 

Vi(z)= I f®(p t Ch,(D2), 

(p,oi,(02)e{0 : l} J ' +2l .-+i 

where 

f {i) {p,(Oi,(02) = p Si (z,p)- (&ddi(p,(Oi,o>2){V i+ i(coi) +V i+ i(cD2)) + mult I -(p,fl)i,fl) 2 )^ + i(fl)i)-V;- + i(fl^)). (3) 
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Iteration i therefore applies the sum-check protocol of Section 4.2 to the polynomial p l >. There remains 
the issue that V can only execute her part of the sum-check protocol if she can evaluate the polynomial p l > 
at a random point fW (r\ , . . . , r Si+ 2s i+[ )■ This is handled as follows. 

Let p* denote the first st entries of the vector (n, . . . , r Sj+ 2 Sj+l ), (o\ the next si+\ entries, and ft)| the last 
Sf+i entries. Evaluating f^> (p* , ft)j* , ft)|) requires evaluating f}(z,p*), add,(j?* ,(Ol,G0%), mult,-(/7*,ft)*,co|), 
V i+ 1 (ft)*), and VJ+i(ffl£). 

V can easily evaluate fi(z,p*) in 0(s,) time. For many circuits, particularly those with "regular" wiring 
patterns, V can evaluate add,- (p* , ft)* , (ti\ ) and mult,- (p* , ft)* , to| ) on her own in poly (j, , s,- + 1 ) time as well, n 

V cannot however evaluate V,- + i (ft)|), and V/ + i (ft)*) on her own without evaluating the circuit. Instead, V 
asks V to simply tell her these two values, and uses iteration i + 1 to verify that these values are as claimed. 
However, one complication remains: the precondition for iteration i+ 1 is that V claims a value for Vj(z) for 
a single z £ F i; . So V needs to reduce verifying both Vi+\ (ffl£) and V,- + i (©J 1 ) to verifying V,- + i (ft)*) at a single 
point oo* G F Xi+1 , in the sense that it is safe for V to accept the claimed values of V,- + i(ft)*) and V,-+i(ft)|) as 
long as the value of V, + i (ft)*) is as claimed. This is done as follows. 

Reducing to Verification of a Single Point. Let £ : F — > F v,+1 be some canonical line passing through ft)* 
and o|. For example, we can let £ be the unique line such that £(0) = V/ +1 (a>f ) and £(l) = VJ- + i((»|). T 3 
sends a degree-s !+ i polynomial h claimed to be V, + i o I, the restriction of V/+i to the line £. V checks that 
h(0) = ft)* and h(l) = ft)| (rejecting if this is not the case), picks a random point r* € F, and asks V to 
prove that Vj + [(£(r*)) = h(r*). By the Schwartz-Zippel Lemma (LemmafTl), as long as V is convinced that 
V i+ i(£(r*)) = h{r*), it is safe for V to believe that the values of V i+ \ (ft)*) and V i+ \ (ft)|) are as claimed by V. 
This completes iteration i; V and V then move on to the iteration for layer i + 1 of the circuit, whose purpose 
is to verify that Vj + i(£(r*)) has the claimed value. 

The Final Iteration. Finally, at the final iteration d, V must evaluate Vd((0*) on her own. But the vector of 
gate values at layer d of C is simply the input x to C. It can be shown that V can compute V c i(co*) on her 
own in 0(n\ogn) time, with a single streaming pass over the input |I5J. Moreover, Vu et al. show how to 
bring V's time cost down to 0(n) [40], but this methodology does not work in a general streaming model. 



For completeness, we present details of both of these observations in Section 4.3.5 



4.3.4 Discussion of Costs. 

Observe that the polynomial f('> defined in Equation ([3]> is an s,- + 2s,- + i-variate polynomial of degree at most 
2 in each variable, and so the invocation of the sum-check protocol at iteration i requires s r + 2s,+i rounds, 
with three field elements transmitted per round. Thus, the total communication cost is 0(d{n) logS(n)) field 
elements, where d(n) is the depth of the circuit C. The time cost to V is 0(n\ogn + d (n) log S(n)), where 
the nlogn term is due to the time required to evaluate V^(ft)*) (see Lemma0below), and the d(n)logS(n) 
term is the time required for V to send messages to V and process and check the messages from V. 

As for V's runtime, for any iteration i of the GKR protocol, a naive implementation of the prover in the 
corresponding instance of the sum-check protocol would require time CL(2 Si+2st+1 ), as the sum defining each 
of V's messages is over as many as 2 Si+ M terms. This cost can be Cl(S(n) 3 ), which is prohibitively large 



Various suggestions have been put forth for what to do if this is not the case. For example, these computations can always 
be done by V in 0(logS(«)) space as long as the circuit is log-space uniform, which is sufficient in streaming applications where 
the space usage of the verifier is paramount |14|. Moreover, these computations can be done offline before the input is even 
observed, because they only depend on the wiring of the circuit, and not on the input (14||21| . Finally, |40| notes that the cost 
of this computation can be effectively amortized in a batching model, where many identical computations on different inputs are 
verified simultaneously. See Section[7]for further discussion, and a protocol that mitigates this issue in the context of data parallel 
computation. 
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in practice. However, Cormode, Mitzenmacher, and Thaler showed in [14] that each gate at layers i and 
i + 1 of C contributes to only a single term of sum, and exploit this to bring the runtime of the V down to 

0(S(n)logS(n)). 

4.3.5 Making V Fast vs. Making V Streaming 

We describe how V can efficiently evaluate V d (co*) on her own, as required in the final iteration of the GKR 
protocol. Prior work has identified two methods for performing this computation. The first method is due 



to Cormode, Thaler, and Yi [ 15 1. It requires 0(nlogn) time, and allows V to make a single streaming pass 



over the input using O(logn) space. 

Lemma 2 ( | |15| ) V can compute V d ((0*) in 0(nlogn) time and 0(logn) space with a single streaming pass 
over the input. 

Proof: We exploit the following explicit expression for V d . For a vector b G {0, l} lo s" let Xb(xi, . . . ,x\ ogn ) = 

Hbfl Zb k ( x k), where Xo( x k) = 1 — jc* and Xi( x k) = **• Notice that %b is the unique multilinear polynomial 
that takes b G {0, l} lo §" to 1 and all other values in {0, l} log " to 0, i.e., it is the multilinear extension of the 
indicator function for boolean vector b. With this definition in hand, we may write: 

V d {pi,...,pi ogn )= Y, Vd(b)Xb(Ph---Plogn) (4) 

foe{0,l} lo s" 

In particular, by letting (pi, . . . ,Piogn) = (0* in Equation ([4]), we see that 

V d (co*)= £ V d (b)Xb(co*). (5) 

*e{o,i} lo g" 

Given any stream update (i,8), let (i\, . . . ,i\ ogn ) denote the binary representation of i. Notice that up- 
date (i,S) has the effect of increasing Vd(h,--., hogn) by S, and does not affect Vd(x\,. ..x\ ogn ) for any 
{x\ , . . . ,x\ ogn ) / (i'i , . . . , fiogn). Thus, V can compute V d {co*) incrementally from the raw stream by initializ- 
ing Vj(ft)*) <— 0, and processing each update (i, 5) via: 

V d ((o*)^U(o*) + 8Xi(co*). 

V only needs to store V d ((0*) and ft)*, which requires 0(logn) words of memory. Moreover, for any i, 
Xn { ilo \{(0*) can be computed in Oilogn) field operations, and thus V can compute V d (co*) with one pass 
over the raw stream, using O(logn) words of space and 0(logn) field operations per update. ■ 



The second method is due to Vu et al. |40|. It enables V to compute Vd{(0*) in 0{n) time, but requires 
V to use 0(n) space. 

Lemma 3 (Vu et al. [401) V can compute V d (co*) in 0{n) time and 0(n) space. 

Proof: We again exploit the expression for V d ((0*) in Equation Q. Notice the right hand size of Equation 
Q expresses V d ((0*) as the inner product of two ^-dimensional vectors, where the bth entry of the first 
vector is V d (b) and the bth entry of the second vector is Xb{®*)- This inner product can be computed in 
0{n) time given a table of size n whose bth entry contains the quantity Xb{(0*). Vu et al. show how to build 
such a table in time 0(n) using memoization. 
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The memoization procedure consists of logn stages, where Stage j constructs a table A"' of size 
V, such that for any (b h ...,bj) G {0, i}J, A^[(b 1 ,...,b j )} = n/ =1 A«)- Notice aW^,...,^)] = 
A^' _1 )[(fei,. . . ,bj-i)] -Xbj((ol), and so the jth stage of the memoization procedure requires time 0(2->). The 

total time across all log« stages is therefore 0(£ °J"2 y ) = 0(2 log ") = 0(n). This completes the proof. ■ 



Remark 1 In ^41^ , Vu et al. further observe that if the input is presented in a specific order, then V can 
evaluate V ( /((0*) using O(logra) space. Compare this result to LemmaU] which requires 0(n\ogn) time for 
V, but allows V to use 0(log«) space regardless of the order in which the input is presented. 

5 Time-Optimal Protocols for Circuit Evaluation 

5.1 Protocol Outline and Section Roadmap 

As with the GKR protocol, our protocol consists of d(n) iterations, one for each layer of the circuit. Each 
iteration starts with V claiming a value for Vt(z) for some value z G F v '. The purpose of the iteration is to 
reduce this claim to a claim about V,- + i (ft)) for some ft) G ¥ Si+1 , in the sense that it is safe for V to assume that 
the first claim is true as long as the second claim is true. As in the GKR protocol, this is done by invoking 
the sum-check protocol on a certain polynomial. 

In order to improve on the costs of the GKR protocol implementation of Cormode et al. p4| , we replace 
the polynomial /w in Equation Q with a different polynomial gW defined over a much smaller domain. 
Specifically, gW is defined over only Sj variables rather than Si+2s{+\ variables as is the case of /w. Using 
gW in place of /w allows V to reuse work across iterations of the sum-check protocol, thereby reducing V's 
runtime by a logarithmic factor relative to fl4) , as formalized in Theorem [T] below. 

The remainder of the presentation leading up to Theorem [T] proceeds as follows. After stating a pre- 
liminary lemma, we describe the polynomial gW that we use in the context of three specific circuits: a 
binary tree of addition or multiplication gates, and a circuit computing the number of non-zero entries of 
an ^-dimensional vector a. The purpose of this exposition is to showcase the ideas underling Theorem [T] in 
concrete scenarios. Second, we explain the algorithmic insights that allow V to reuse work across iterations 
of the sum-check protocol applied to gW. Finally, we state and prove Theorem[l| which formalizes the class 
of circuits to which our methods apply. 

5.2 A Preliminary Lemma 

We will repeatedly invoke the following lemma, which allows us to express the value V,(z) in a manner 
amenable to verification via the sum-check protocol. This is essentially a restatement of pT] Lemma 3.2.1]. 

Lemma 4 Let W be any polynomial W s > — > F that extends V{, in the sense that for all p G {0, 1} S ', W(p) = 
Vi(p). Then for any z G ¥ Si , 

Vi(z)= £ P Si (z,p)W(p). (6) 

pe{0,l} s '' 

Proof: It is easy to check that the right hand size of Equation ([6]) is a multilinear polynomial in z, and that it 
agrees with V, on all Boolean inputs. Thus, the right hand side of Equation ([6]), viewed as a polynomial in z, 
must be the multilinear extension V, of V,. This completes the proof. ■ 
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5.3 Polynomials for Specific Circuits 

5.3.1 The Polynomial for a Binary Tree 

Consider a circuit C that computes the product of all n of its inputs by multiplying them together via a 
binary tree. Label the gates at layers i and / + 1 in the natural way, so that the first input to the gate 
labelled p = (p\ , . . . ,p Si ) G {0, l} Xi at layer i is the gate with label (p,0) at layer i — 1, and the second input 
to gate p has label (p, 1). Here and throughout, (p,0) denotes the Sj + 1 -dimensional vector obtained by 
concatenating the entry to the end of the vector p. Interpreting p = {pi,. ■ ■ ,p Si ) € {0, 1} S ' as an integer 
between and 2 s ' — 1 with p\ as the high-order bit and p Sj as the low-order bit, this says that the first in- 
neighbor of p is 2p and the second is 2p + 1. It follows immediately that for any gate p G {0, l} Xi at layer i, 
Vi(p) = Vi+i(p,0) -Vi+i(p, 1). Invoking LemmaWl we obtain the following proposition. 

Proposition 2 Let Cbea circuit consisting of a binary tree of multiplication gates. Then Vi(z) = Lpe{o,i} ! < 8 {p)> 
where gU(p)=p Si (z,p)-V i+1 (p,0)-V i+1 (p,l). 

Remark 2 Notice that the polynomial gW in Proposition |2J is a degree three polynomial in each variable 
of p. When applying the sum-check protocol to g('\ the prover therefore needs to send 4 field elements per 
round. 

In the case of Proposition^ the line £ in the "Reducing to Verification of a Single Point" step has an 
especially simply expression: £(t) = (r,t), where r £ ¥ Si is the vector of random field elements chosen by V 
over the execution of the sum-check protocol. In this case, V,+i o£ has degree I, and is implicitly specified 
when V sends the claimed values ofV((r,0) and V,(r, 1). This does not affect the asymptotic costs of the 
protocol but does slightly simplify the implementation. 

The case of a binary tree of addition gates is similar to the case of multiplication gates. 

Proposition 3 Let Cbea circuit consisting of a binary tree of addition gates. Then Vi(z) = Epe{0,l} s i 8 (p)> 
where g( i \p)=p Si (z,p) (V i+l (p,0)+V i+1 (p,\)) . 

Remark 3 The polynomial g( l > of Proposition \3j has degree 2 in all variables, rather than degree 3 as in 
Proposition [2] 

5.3.2 The Polynomials for distinct 

We now describe a circuit C for computing the number of non-zero entries of a vector a £ ¥" (this vector 
should be interpreted as the frequency vector of a data stream). A similar circuit was used in conjunction 
with the GKR protocol in fl4j to yield an efficient protocol with a streaming verifier for DISTINCT, and we 
borrow heavily from the presentation there. We remark that our refinements enable us to slightly simplify 
the circuit used in p4) by avoiding the awkward use of a constant- valued input wire with value set to 1 . 
This causes some gates in our circuit to have fan-in 1 rather than fan-in 2, which is easily supported by our 
protocol. 

The circuit C is tailored for use over the field of cardinality equal to a Mersenne prime q = 2 k — 1 
for some k. Fields of cardinality equal to a Mersenne prime can support extremely fast arithmetic, and as 
discussed later in Section |6.2[ there are several Mersenne primes of appropriate magnitude for use within 
our protocols. 

The circuit C exploits Fermat's Little Theorem, computing a\ for each input entry at before summing 
the results. As described in fl4| , verifying the summation sub-circuit can be handled with a one invocation 
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Figure 1: The first several layers of a circuit for Fq on four inputs over the field F with q = 2 k — 1 elements. 
The first layer from the bottom computes a 2 for each input entry a\. The second layer from the bottom 
computes a] and af for all i. The third layer computes af and af = af x a 2 , while the fourth layer computes 
a' 6 and a} A = af x af. The remaining layers (not shown) have structure identical to the third and fourth 
layers until the value a\ is computed for all i, and the circuit culminates in a binary tree of addition gates. 

of the sum-check protocol, or less efficiently by running our protocol for a binary tree of addition gates 
described in Proposition [3j 

We now turn to describing the part of the circuit computing a q f for each input entry a,. We may write 
q — 1 = 2^ — 2, whose binary representation is k — 1 Is followed by a 0. Thus, af = 11/= 1 a f • To compute 
af~ , the circuit repeatedly squares a, and multiplies together the results "as it goes". In more detail, for 
j > 2 there are two multiplication gates at each layer d(n) — j of the circuit for computing af ; the first 
computes a 21 by squaring the corresponding gate at layer j — I, and the second computes 11/= i a j ■ See 
Figure [TJ for a depiction. 

For our purposes there are k+\ relevant circuit layers, all of which consist entirely of multiplication 
gates. Layers 1 through k — 1 all contain In gates. Number the gates from to 2n — 1 in the natural way. 
In what follows, we will abuse notation and use p to refer to both a gate number as well as its binary 
representation. 

An even-numbered gate p at layer i has both in- wires connected to gate p at layer / + 1, while an 
odd-numbered gate p has one in- wire connected to gate p and another connected to gate p — l. Thus, the 
connectivity information of the circuit is a simple function of the binary representation p of each gate at 
layer i. If the low-order bit p Sj of p is (i.e. it is an even-numbered gate), then both in-neighbors at layer 
i + 1 of gate p have binary representation p. If the low-order bit p Sj is 1 (i.e. it is an odd-numbered gate), 
then the first in-neighbor of gate p has binary representation p, and the second has binary representation 
(/?_. v .,0), where p_ Si denotes p with the coordinate p Sj removed. 

Invoking Lemma|4j the following proposition is easily verified. 

Proposition 4 Let C be the circuit described above. For layers i G {1, . . . ,k— 1}, V((z) = Y,pz{0 1} S > 8 (p) 
where 

g®(p) = p Si (z,p) {(l-p Si )V i+1 (p- Si ,0)-V i+1 (p- Si ,0) +p Si V i+1 (p- Si , l)-V i+1 (p- Si ,0)) , 

where p- Sj denotes p with the coordinate p Sj removed. 

Remark 4 To check V 's claim in the final round of the sum-check protocol applied to g^\ V needs to know 
V/+i(r,0) and V/+i(r, I) for some random vector r £ ¥ s '~ . This is identical to the situation in the case of a 
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binary tree of addition or multiplication gates, where the "Reducing to Verification of a Single Point" step 
had an especially simple implementation. 

At layer k, an even-numbered gate p has both in-wires connected to gate p/2 at layer k+ I, while an 
odd-numbered gate p has its unique in-wire connected to gate (p — l)/2 at layer k+ 1. Thus, for a gate at 
layer i = k, if the the low-order bit p Sj of the gate's binary representation p is 1 (i.e. it is an odd-numbered 
gate), then both in-neighbors at layer i + 1 of have binary representation p- Sj . If the low-order bit p Sj is 
(i.e. it is an even numbered gate), then the unique in-neighbor of p at layer i + 1 has binary representation 

P-s r 

Invoking Lemma[4| the following is easily verified. 

Proposition 5 Let C be the circuit described above. For layer i = k, Vi(z) = £ng{o iy< 8 (p) where 

g { Hp)=^,{z,p){{\-p Sl )v i+Y {p^)-v i+l {p^)+p s y i+Y {p^)), 

where p- Sj denotes p with coordinate p Sj removed. 

Finally, at layer k + 1 , each gate p has both in-wires connected to gate p at layer k + 2 (which is the input 
layer). Thus: 

Proposition 6 Let C be the circuit described above. For layer i = k+l, Vt(z) = Lpefo.i} 1 ' 8 (p) where 

g( i \p) = P Si (z,p)V i+l (p)-V i+i (p). 

5.4 Reusing Work 



Recall that our analysis of the costs of the sum-check protocol in Section 4.2. 1 revealed that, when applying 
a sum-check protocol to an s ( -variate polynomial gW, V only needs to evaluate gW at 0(2*'') points across 
all rounds of the protocol. Our goal in this section is to show how V can do this in time 0(2 Si +2 V,+1 ) = 
0(Si + Sj+i) for all of the polynomials gW described in Section ; 
O (Xi=i $i) = 0(S(n)) time across all iterations of our circuit-checking protocol. 



This is sufficient to ensure that V takes 



To this end, notice that all of the polynomials g described in Propositions [2|6] have the following prop- 
erty: for any r G F V| , evaluating gW (r) can be done in constant time given j8 (z, r) and the evaluations of Vj+i 
at a constant number of points. For example, consider the polynomial g^ described in PropositionUj g^''(r) 
can be computed in constant time given f3 Sj (z,r), VJ- + i (r_ Si , 0), and V(r_ Sn 1). 

Moreover, the points at which V must evaluate g^' within the sum-check protocol are highly structured: 
in round j of the sum-check protocol, the points are all of the form (n, . . . ,rj-i,t,bj+i, ■ ■ ■ ,b Sj ) with t € 
{0,l,...,deg ; .feW)}and(Vi,---,^)e{0,l}' v '-A 

5.4.1 Computing the Necessary /3 (z, p) Values 

Pre-processing. We begin by explaining how V can, in 0(2 Si ) time, compute an array O ) of length 2 Si of all 
values P(z,p) = Tfk=i(PkZk+ (1 — Pk)(l ~Zk)) for p G {0, 1} V '. T 3 can do this computation in preprocessing 
before the sum-check protocol begins, as this computation does not depend on any of V's messages. Naively, 
computing all entries of C^ would require 0(s{2 Si ) time, as there are 2 Si values to compute, and each 
involves Q.(s{) multiplications. However, this can be improved using dynamic programming. 
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The dynamic programming algorithm proceeds in stages. In stage j, V computes an array C^^> of 
length lL Abusing notation, we identify a number p in [2-i] with its binary representation in {0, 1} ; . V 
computes 

C° J [p} = U(p k z k + (l-p k )(l-z k )) 

k=l 

via the recurrence 

C { ^l(p l ,...,pj)]=C^- l [(p u ...,p j - l )]-(pjZ j + (l-pj)(l-z j )). 

Clearly c'°" v '^ equals the desired array C'°\ and the total number of multiplications required over the entire 
procedure is 0(Y?j-i 2 7 ) = 0(2 S '). We remark that our dynamic programming procedure is similar to the 
method used by Vu et al. to reduce the verifier's runtime in the GKR protocol from 0{n\ogn) to 0(n) in 
Lemma [3] 

Overview of Online Processing. In round j of of the sum-check protocol, V needs to evaluate the polynomial 
fi(z,p) at 0(2 s --i) points of the form (n, . . .,rj_i,t,bj + i,. . .,b Si ) for t € [deg y -(gW)] and (b j+i ,. . .,b Si ) £ 
{0, l} Si ~ J . V will do this using the help of intermediate arrays CV> defined as follows. 
Define C (i) to be the array of length 2 Si ~ J such that for (j>j + i, ■ . -,p Si ) G {0, l} Si ~ j : 



cW[(pj + u...,p Si )]=m(r kZk + (l-r k )(l-z k ))y[ U(PkZ k +(l- Pk )(l-Z k ))j, 

Efficiently Constructing C^ Arrays. Inductively, assume V has computed the array OJ- 1 ) in the previous 
round. As the base case, we explained how V can evaluate C^ ' in 0(2*') time in pre-processing. Now 
observe that V can compute CV' given C^~ 1 ^ in 0(2 i;_J ) time using the following recurrence: 

C^[(pj +1 ,...,p Si )}=zJ 1 d^[(l,p j+l ,...,p Si )}-(r j Zj + (l-r j )(l-Zj)). (7) 

Remark 5 Equation Q is only valid when Zj ^ 0. To avoid this issue, we can have V choose Zj at random 
from F* rather than from F, and this will affect the soundness probability by at most an additive 0{d{n) ■ 

logS{n)/\¥\) factor. 

Remark 6 Since computing multiplicative inverses in a finite field is not a constant-time operation, it is 
important to note that zj only needs to be computed once when determining the entries ofC^>, i.e. it need 
not be recomputed for each entry ofC^K Therefore, across all si rounds of the sum-check protocol, only 
O(sj) time in total is required to compute these multiplicative inverses, which does not affect the asymptotic 
costs for V. We discount the costs of computing z~j for the remainder of the discussion. 

Thus, at the end of round j of the sum-check protocol, when V sends V the value rj, V can compute 
C^ from C^ -1 ) using Equation {J} in 0(2 Si - j ) time. 

Using the C^' Arrays. Observe that given any point of the form p = (n,. . . ,rj-i,t,bj + \,. . . ,b Si ) with 
(bj+i, . . -,b Si ) G {0, l} i; ~ 7 , j8(z,p) can be evaluated in constant time using the array CV -1 ), using the equal- 
ity 

P(z,p) = cU-V[(l,p j+1 ,...,p Si )]-zj 1 -(tZj + (l-t)(l-Zj)). 

21 



As above, note that z 1 can be computed just once and used for all points p, and this does not affect the 
asymptotic costs for V. 

Putting Things Together. In round j of the sum-check protocol, V uses the array CV~ l > to evaluate the 
0(2 Si ~ ] ) required p(z,p) values in 0(2 Si ~J) time. At the end of round j, V sends V the value rj, and V 
computes CV' from CV -1 ) in 0(2 Si ~J) time. In total across all rounds of the sum-check protocol, V spends 
0(EyLi 2 Si ~ j ) = 0(2 Si ) time to compute the f5(z,p) values. 

5.4.2 Computing the Necessary Vj+i (p) Values 

For concreteness and clarity, we restrict our presentation within this subsection to the polynomial gW de- 
scribed in Proposition [4] Theorem [j] abstracts this analysis into a general result capturing a large class of 
wiring patterns. 

Recall that all of the polynomials gW described in Propositions [2j)6J have the following property: for any 
p G F 5 ', evaluating gW (p) can be done in constant time given j3 (z, p) and the evaluations of V, + i at a constant 
number of points. We have already shown how V can evaluate all of the necessary fi(z,p) values in 0(2*') 
time. It remains to show how V can evaluate all of the Vj+i values in time 0(2*' + 2 V,+1 ) . We remark that in 
the context of Proposition|4| S{ = s, + i; however, we still distinguish between these two quantities throughout 
this subsection in order to ensure maximal consistency with the general derivation of Theorem [T] 

Recall that the polynomial g® in Proposition |4| was defined as follows: 

g«(p) = P tl ( Z ,p) ((l-p Si )V i+1 (p_ Si ,0)-V i+1 (p_ Si ,0) +p Sl V{p- Si , 1) -Hp-snO)) ■ 
In round j of the sum-check protocol, V needs to be evaluate g at all points in the set 

S^ftn,...,^-!,/,^!,...^):'^ 

By inspection of gW, it suffices for V to evaluate V\+\ at the same set of points. To show how to accomplish 
this efficiently, we exploit the following explicit expression for V, + i. This expression was derived for the 
case i + 1 = d in Equation ([4]) within Lemma|2} we re-derive it here in the general case. 

With this definition in hand, we may write: 



For a vector b G {0, l} v '+» let % b {x\,. . . ,x Si+l ) = Y[i=i Xh(x k ), where Xo(xk) = 1 -xt and %i(x k ) = x k . 



V i+ i(pi,...,p Si+l )= £ V i+ i(b)Xb(pu-~Ps M ), (8) 

be{o,i} s i+i 

To see that Equation ([8]) holds, notice that the right hand side of Equation ([8]> is a multilinear polynomial 
in the variables (pi,. .. ,b Pi+1 ), and that it agrees with V|-+i at all points p € {0, l} i,+1 . Hence, it must be the 
unique multilinear extension of V, + i. 

The intuition behind our optimizations is the following. In round j of the sum-check protocol, there 
are S^> points at which V,-+i must be evaluated. Equation ([8]) can be exploited to show that each gate at 
layer i + 1 of the circuit contributes to V,- + i (p) for at most one point p € S^' ; namely the point p whose last 
s i+i — j coordinates agrees with those of p. This observation alone is enough to achieve an 0(5, + i logS,-) 
runtime for V in total across all iterations of the sum-check protocol, because there are S,-+i gates at layer 
i+ 1, and only s, = logSi rounds of the sum-check protocol. However, we need to go further in order to 
shave off the last log Si factor from V's runtime. Essentially, what we do is group the gates at layer i + 1 by 
the point p € S^ to which they contribute. Each such group can be treated as a single unit, ensuring that 
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the work V has to do in any round of the sum-check protocol in order to evaluate Vi + \ at all points in S^' is 
proportional to |5^| rather than to S,-+i. Since the size of S^> falls geometrically with j, our desired time 
bounds follow. 

Pre-processing. V will begin by computing an array v'°\ which is simply defined to be the vector of gate 
values at layer i + 1 i.e. identifying a number < j < 5, + i with its binary representation in {0, 1 }*'+', V 
setsV ( -V[(j h ...J Si+1 )] = Vi +1 tih---,js i+1 ) for each (jh---Js i+1 ) G {0,1}*+'. The right hand side of this 
equation is simply the value of the jth gate at layer i + 1 of C. So V can fill in the array V^ when she 
evaluates the circuit C, before receiving any messages from V. 

Overview of Online Processing. In round j of of the sum-check protocol, V needs to evaluate the polynomial 
Vi+i at the 0(2 Si ~i) points in the set S^>. V will do this using the help of intermediate arrays V^ defined as 
follows. 

Define V w to be the length 2 Si+] ~ j array such that for {pj+\ ,... ,p SM ) G {0, l} Si+1 ~ j , 

j 
y {j) [(Pj+u- ■ -,Ps, +l )} = £ V i+l (b u . . .,bj,p J+ i,. . . , p, M ) ■ Y\xb k (n), 

(bi,...,bj)e{0,l}J k=\ 

Efficiently Constructing V^> Arrays. Inductively, assume V has computed in the previous round the array 
V C/~ ! ) of length 2 Si + l ~J +l . 

As the base case, we explained how V can fill in V@' in the process of evaluating the circuit C. Now 
observe that V can compute V^ given V^ -1 ) in 0(2 Si+1 ~J) time using the following recurrence: 

vM[(p j+l ,...,p^ 1 )}=vV-V[(0,p J+h ...,p, t )]'ZQM 

Thus, at the end of round j of the sum-check protocol, when V sends V the value rj, V can compute 
VC/) from V^ 1 ) in 0(2*+ 1 -->' +1 ) time. 

Lfrmg ?/ze V") Arrays. We now show how to use the array V^~ 1 ' to evaluate V, + i(p) in constant time for 
any point of the form p = (n,. . . ,rj-i,t,bj+i,. . . ,b Si+l ) with (6/+1 , . . . , b Si+1 ) G {0, 1} V,+1_7 . We exploit the 
following sequence of equalities: 

V(r x ,. . .,rj_ h t,b j+ i,. ■ -,b Si ) = Y, v i+i( c )Xc(n,- • .,rj_ u t,b j+ i,. . .,b Si+l ) 

ce{0,l} s i+i 

= £ E V i+ i(c)Xc(rh---,rj-i,t,bj + i,...,b Si+l ) 

(c [ ,...,c J )E{o,i}J( Cj+[ ,...,c Si+1 )e{Q,iyM-J 

I I V'+ito [UXc k (r k ) J (fc,(f)) ( II **(**) 

(ci,...,c / )e{0,l}/( C;+1 ,..., Ci . +1 )g{o,l} s i+i-^ \*=l / \*=j+l 

L Vi+l (Cj+h • ■ • > C/, fy+1 , • ■ • , Vi ) II &* ( r *) • Xcj (0 

(ci,...,cy)e{0,l}> \k=l J 

= V^-^[(0,b j+h ...,b Si+1 )}- X o(t)+v(-i-^[(l,b j+1 ,...,b Si+1 )}-Xi(t). 

Here, the first equality holds by Equation ([8]). The third holds by definition of the function Xc- The 
fourth holds because for Boolean values bk,ct G {0, 1}, Xc k (bk) = 1 if c* = bk, and Xc k {bk) = otherwise. 
The final equality holds by definition of the array V" _1 \ 
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Putting Things Together. In round j of the sum-check protocol, V uses the array y^ -1 ) to evaluate V/+i (p) 
for all 0(2 Si ~J) points p G 5"'. This requires constant time per point, and hence 0(2 Si ~i) time across all 
points. At the end of round j, V sends V the value r,, and V computes V^ from y0 _1 ) in 0(2 SM ~J) time. 
In total across all rounds of the sum-check protocol, V spends 0(Yf}=\ 2 Sl ~* + 2 Si+1 ~J) = 0(2 Si + 2 Si+ ' ) time 
to evaluate Vj+i at the relevant points. When combined with our 0(2 ,v, )-time algorithm for computing all 
the relevant f5(z,p) values, we see V takes 0(2 Si + 2 V,+1 ) = 0(Si +5,-+i) time to run the entire sum-check 
protocol for iteration i of our circuit-checking protocol. 

5.5 A General Theorem 

In this section we formalize a large class of circuits to which our refinements yield asymptotic savings 
relative to prior implementations of the GKR protocol. Our protocol makes use of the following functions 
that capture the wiring structure of an arithmetic circuit C. 

Definition 2 Let C be a layered arithmetic circuit of depth d(n) and size S(n) over finite field F. For every 
i£ {\,...,d—\}, letiny : {0, \} Si — > {0, 1}' V,+1 andin^ '■ {0,1}'*' — > {0, 1} ,S,+1 denote the functions that take 
as input the binary label p of a gate at layer i of C, and output the binary label of the first and second 
in-neighbor of gate p respectively. Similarly, let type 1 - 1 ' : {0, 1} S ' — > {0, 1} denote the function that takes as 
input the binary label p of a gate at layer i of C, and outputs if p is an addition gate, and 1 if p is a 
multiplication gate. 

In the following definition, one should think of v and v' as no larger than the number of bits in a constant 
number of machine words. 

Definition 3 Let f be a function mapping {0, 1} V to {0, 1} V . Number the v input bits from 1 to v, and the 
v' output bits from 1 to v' . We say that f is regular if f can be evaluated on any input in constant time, and 
there is a subset of input bits S C [v] with \S\ =0(1) such that: 

1. Each input bit in [v] \ S affects 0(1) of the output bits of f. Moreover, given input j G [v] \ S, the set 
Sj of output bits affected by Xj can be enumerated in constant time. 

2. Each output bit of f depends on at most one input bit. 

Our protocol applied to C proceeds in d(n) iterations, where iteration i consists an application of the 
sum-check protocol to an appropriate polynomial derived from type''-* , in/ , and inj , followed by a phase 
for "reducing to verification of a single point". For any layer i of C such that in| , in!, and typeW are all 
regular, we can show that V can execute the sum-check protocol at iteration i in 0(Si + S{ + \) time. To ensure 
that V can execute the "reducing to verification of a single point" phase in 0(S{ + i) time, we need to place 
one additional condition on in/ and in!, . 

Definition 4 We say that in\ and in^ are similar if there is a set of output bits T C [st+\] with \T\ = 0(1) 
such that for all inputs x, the jth output bit ofinf equals the jth output bit ofin^for all j G [si+i] \ T. 

We are finally in a position to state the class of circuits to which our refinements apply. 

Theorem 1 Let C be an arithmetic circuit, and suppose that for all layers i of C, in\ , in\ , and type^ 1 ' 
are regular. Suppose moreover that in[ is similar to in^ f or all but 0(1) layers i of C. Then there is a 
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valid interactive proof protocol (V,V)for the function computed by C, with the following costs. The total 
communication cost is \0\ + O (d(n) log S(n)) field elements, where \0\ is the number of outputs of C. The 
time cost to V is O (nlogn + d(n) log S(n)), and V can make a single streaming pass over the input, storing 
0(log(S(n))) field elements. The time cost to V is 0(S(n)). 

The asymptotic costs of the protocol whose existence is guaranteed by Theorem [T] are identical to those 
of the implementation of the GKR protocol due to Cormode et al. in [14], except that in Theorem HYP runs 
in time 0(S(n)) rather than 0(S(n) log5'(«)) as achieved by [14]. We defer the proof to Appendix |A| 

5.5.1 Applications 

Theorem [T] applies to circuits computing functions from a wide range of applications, with the following 
implications. 

MATMULT. Consider the following circuit C of size 0(n 3 ) for multiplying two n xn matrices A and B. 
Let the input gate labelled (0,z, j) correspond to Ajj, and the input labelled (l,i,j) correspond to By. The 
layer of C adjacent to the input consists of n 3 gates, where the gate labeled (/, j,k) € ({0, l} lo §") 3 computes 
A-ik-Bkj- All subsequent layers constitute a binary tree of addition gates summing up the results and thereby 
computing Y,k^ikBkj for all (i,j) € [n] x [n]. 

For layers j € {!,... ,logn} of this circuit, irq in!, , and type^ are all regular, and moreover in-, is 



similar to in!, (see Section 5.3.1 for a careful treatment of this wiring pattern). The remaining layer of the 



circuit, layer i = logn + 1, is regular, though m\ ° gn and in!, ogn are not similar. We obtain the following 
immediate corollary. 

Corollary 1 There is a valid interactive proof protocol for nxn MATMULT with the following costs. The 
total communication cost is n 2 + 0{d{n) logn) field elements, where the n term is required to specify the 
answer. The time cost to V is 0{n 2 logn), and V can make a single streaming pass over the input in time 
0(n 2 logn) and storing Oilogn) field elements. The time cost to V is 0(n 3 ). 

We note that the costs of Corollary[T]are subsumed by our special-purpose matrix multiplication protocol 
presented later in Theorem [3] We included Corollary [T] to demonstrate the applicability of Theorem [T] 

DISTINCT. Recall the circuit C over field size q = 2 k — 1 described in Section 5.3.2 that takes a vector a G F" 



as input and outputs the number of non-zero entries of a. This circuit has k + 1 relevant layers and consists 
entirely of multiplication gates. For any layer i 6 [it — 1], an even-numbered gate p at layer i has both in- 
wires connected to gate p at layer i + 1 , while an odd-numbered gate p at layer i has one in- wire connected 
to gate p at layer i+ 1 and another connected to gate p—l (which has binary representation (p_ Xi ,0), where 
p_ Si denotes the binary representation of p with the coordinate p Si removed). For these layers, in 2 , in;, , 

and type 1 -') are all regular, and inj is similar to in!, . 

At layer k, an even-numbered gate p is has both in- wires connected to gate p/2 at layer k+l, while 
an odd-numbered gate p at layer k has its unique in- wire connected to gate (p — l)/2 at layer k+l. In 
the former case, both in-neighbors of gate p have binary representation p_ Si . In the latter case the unique 
in-neighbor of gate p has binary representation p- Si . It is therefore easily seen that in ; , in;, , and type^ 

(k) Ik) 

are all regular, and in ; is similar to in^ . Finally, at layer k + l, both in- wires for gate p are connected to 
gate p at layer k + 2. It is easily seen that in^ ,in;, , and type^ +1 ) are all regular, and in| is similar 
to in!, . With all layers of C satisfying the requirements of Theorem h| we obtain the following corollary. 
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Corollary 2 Let q > max{m,n} be a Mersenne Prime. There is a valid interactive proof protocol over the 
field ¥qfor DISTINCT with the following costs. The total communication cost is 0(log n log q) field elements. 
The time cost to V is 0(m\ogn), and V can make a single streaming pass over the input, storing 0(\og(n)) 
field elements. The time cost to V is 0(nlogq). 

To or knowledge, Corollary [2]yields the fastest known prover of any streaming interactive proof protocol 
for DISTINCT that also has total communication and space usage for V that is sublinear in both m and n. 
The fastest result previously was the 0(n-log(n) -log(/j))-time prover obtained by the implementation of 
Cormode et al. (14). We remark however that for a data stream with Fo distinct items, the prover in fl4) 
actually can be made to run in time O(n + Fo-Iog(n) -log(p)), where the 0(n) term is due to the time 
required to simply observe the entire input stream. Therefore, for streams where Fo = o(n/logn), the 
implementation of [ 14] achieves an asymptotically faster prover than implied by Corollary [2] 



Remark 7 Cormode et al. in l\14\ Section 3.2] describe how to extend the GKR protocol to handle circuits 
with gates that compute more general operations than just addition and multiplication. At a high level, / [i4| / 
shows that gates computing any "low-degree" operation can be handled, and they demonstrate analytically 
and experimentally that these more general gates can achieve cost savings for the DISTINCT problem. These 
same optimizations are also applicable in conjunction with our refinements. We omit further details for 
brevity, and did not implement these optimizations in conjunction with our refinements. 

Other Problems. In order to demonstrate its generality, we describe two other non-trivial applications of 
Theorem [TJ 

• Pattern Matching. In the Pattern Matching problem, the input consists of a stream of text T = 
(to, .. . ,t n -\) G [n] n and pattern P = (po, • • • ,Pm-l) £ [n] m . The pattern P is said to occur at loca- 
tion i in T if, for every position k in P, p^ = tj + k. The pattern-matching problem is to determine the 
number of locations at which P occurs in T. For example, one might want to determine the number 
of times a given phrase appears in a corpus of emails stored in the cloud. 

Cormode et al. describe the following circuit C for Pattern Matching over the finite field ¥ q . The 
circuit first computes the quantity /, = J2T=o( f '+; ~Pj) f° r eacn ' ^ [["]]» an( ^ tnen ex pl°its Fermat's 
Little Theorem (FIT) by computing M = L"=i"^f ■ The number of occurrences of the pattern equals 
n — m — M. 

Computing /, for each i can be done in logra + 2 layers: the layer closest to the input computes 
U+k — Pk for each pair (i,k) £ [[«]] x [[q\], the next layer squares each of the results, and the circuit 
then sums the results via a depth log m-binary tree of addition gates. The total size of the circuit C is 
0(nm + n\ogq), where the nm term is due to the computation of the /; values, and the n\ogq term is 
due to the FLT computation. The total depth of the circuit is 0(logra + logg) = O(logg). 

We have already demonstrated that Theorem[TJapplies to the squaring layer, the binary tree sub-circuit, 
and the FLT computation. The only remaining layer of the circuit is the one that computes ti+k — Pk 
for each pair (i,k) G [[«]] x [[m]]. Unfortunately, Theorem [I] does not apply to this layer of the circuit. 
This is because the first in-neighbor of a gate with label (i\, . . . , £i gn,&i, • • • j^iogm) S {0, l} lo g"+ lo g"' 
has label equal to the binary representation of the integer i + k, and a single bit ij can affect many bits 
in the binary representation of i + k (likewise, each bit in the binary representation of i + k may be 
affected by many bits in the binary representation of i and k). 
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However, in Appendix [B] we describe how to extend the ideas underlying Theorem [T] to handle this 
wiring pattern. The extensions in Appendix [B] may be more broadly useful, as the wiring pattern 
analyzed there is an instance of a common paradigm, in that it interprets binary gate labels as a pair 
of integers and performs a simple arithmetic operation (namely addition) on those integers. 

We also remark that, instead of going through the analysis of Appendix [B] a more straightforward 
approach is to simply apply the implementation of p4| to this layer; the runtime for V in the corre- 
sponding sum-check protocol is 0(nmlogn). This does not affect the asymptotic costs of the protocol 
if m is constant, since in this case nm log n = 0{n\ogq), and the total runtime of V over all other layers 
of the circuit is ©(nlogg). 

This analysis highlights the following point: our refinements can be applied to a circuit on a layer-by- 
layer basis, so they can still yield speedups even if some but not all layers of a circuit are sufficiently 
"regular" for our refinements to apply. 

A similar analysis applies to a closely related circuit that solves a more general problem known as 
Pattern Matching with Wildcards. We omit these details for brevity. 



• Fast Fourier Transform. Cormode et al. [14| also describe a circuit over C for computing the stan- 
dard radix-two decimation-in-time FFT At a high level, this circuit works as follows. It proceeds 
in log« stages, where for k = (ki,. ■ . ,k n ) G {0, 1}", the kth output of stage i is recursively defined 
as Vi(k h . ..,k n ) = Vi-i(k h ki-i,0,ki, ■ ■ ■ ,k n )+^ 2nkiln Vi-i{h,. . .,ki- h l,k i+h . . .,k n ). Theorem[j]is 
easily seen to apply to the natural circuit executing this recurrence, and our refinements would there- 
fore shave a logarithmic factor off the runtime of V applied to this circuit, relative to the implemen- 
tation of [14] (since this circuit is defined over the infinite field C, the protocol is only defined in a 
model where complex numbers can be communicated and operated on at unit cost). 

6 Experimental Results 

We implemented the protocols implied by Theorem [T] as applied to circuits computing MATMULT and 
DISTINCT. These experiments serve as case studies to demonstrate the feasibility of Theorem [T] in prac- 
tice, and to quantify the improvements over prior implementations. While Section [8]describes a specialized 
protocol for MATMULT that is significantly more efficient than the protocol implied by Theorem [T] MAT- 
MULT serves as an important case study for the costs of the more general protocol described in Theorem 
[T] and allows for direct comparison with prior implementation work that also evaluated general-purpose 
protocols via their performance on the MATMULT problem 1 14 35 36,38,40]. 



Our comparison point is the implementation of Cormode et al. [ 14], with some of the refinements of Vu 
et al. [40] included. In particular, our comparison point for matrix multiplication uses the refinement of [40] 
for circuits with multiple outputs described in Section |4.3.2| We did not include Vu et al.'s optimization 
from Lemma[3]that reduced the runtime of V from 0(nlogn) to 0(n), because this optimization blows up 
the space usage of V to Q.(n), while we want to use a smaller-space verifier for streaming applications such 
as DISTINCT. 

6.1 Summary of Results 

The main takeaways of our experiments are as follows. When Theorem [T] is applicable, the prover in the 
resulting protocol is 200x-250x faster than the previous state of the art implementation of the GKR protocol. 
The communication costs and the number of rounds required by our protocols are also 2x-3x smaller than 
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the previous state of the art. The verifier in our implementation takes essentially the same amount of time 
as in prior implementations of the GKR protocol; this time is much smaller than the time to perform the 
computation locally without a prover. 

Most of the observed 200x speedup can be directly attributed to our improvements in protocol design 
over prior work: the circuit for 512x512 matrix multiplication is of size 2 , and hence our logS factor 
improvement the runtime of V likely accounts for at least a 28x speedup. The 3x reduction in the number of 
rounds accounts for another 3x speedup. The remaining speedup factor of roughly 2x may be due to a more 
streamlined implementation relative to prior work, rather than improved protocol design per se. 

We have both a serial implementation and a parallel implementation that leverages graphics processing 
units (GPUs). The prover in our parallel implementation runs roughly 30x faster than the prover in our serial 
implementation. The ability to leverage GPUs to obtain robust speedups in our setting is not unexpected, as 
Thaler, Roberts, Mitzenmacher, and Pfister demonstrated substantial speedups for an earlier implementation 



of the GKR protocol using GPUs in [38]. 



All of our code is available online at [39]. All of our serial code was written in C++ and all experiments 
were compiled with g++ using the —03 compiler optimization flag and run on a workstation with a 64-bit 
Intel Xeon architecture and 48 GBs of RAM. We implemented all of our GPU code in CUDA and Thrust (24) 
with all compiler optimizations turned on, and ran our GPU implementation on an NVIDIA Tesla C2070 
GPU with 6 GBs of device memory. 

6.2 Details 

Choice of Finite Field. All of our circuits work over the finite field of size q = 2 61 — 1. Several remarks 
are appropriate regarding our choice of field size. This field was used in our earlier work fl4] because 
it supports fast arithmetic, as reducing an integer modulo q can be done with a bit-shift, addition, and a 
bit-wise AND. (The same observation applies to any field whose size equals a Mersenne Prime, including 
2 89 — 1, 2 107 — 1, and 2 127 — 1). Moreover, the field is large enough that the probability a verifier is fooled by 
a dishonest prover is smaller than 1/2 for all of the problems we consider (this probability is proportional 

d(n)logS(n) s 

The main potential issue with our choice of field size is that "overflow" can occur for problems like 
matrix multiplication if the entries of the input matrices can be very large. For example, with 512 x 512 
matrix multiplication, if the entries of the input matrices A,B are larger than 2 26 , an entry in the product 
matrix AB can be as large as 2 61 , which is larger than our field size. If this is a concern, a larger field size is 
appropriate. (Notice that for a problem like DISTINCT, there is no danger of overflow issues as long as the 
length of the stream is smaller than 2 61 — 2, which is larger than any stream encountered in practice). 

A second reason to use larger field sizes is to handle floating-point or rational arithmetic as proposed by 
Setty etal. in (55) . 

All of our protocols can be instantiated over fields with more than q = 2 61 — 1 elements, with an imple- 
mentation using these fields experiencing a slowdown proportional to the increased cost of arithmetic over 
these fields. 

6.2.1 Serial Implementation 

MATMULT. The costs of our serial M ATMULT implementation are displayed in Table [T] The prover in our 
matrix multiplication implementation is about 250x faster than the previous state of the art. For example, 
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Implementation 


Problem Size 


V Time 


V Time 


Rounds 


Total Communication 


Circuit Eval Time 


Previous state of the art 


256 x 256 


1054.35 s 


0.02 s 


623 


14.6 KBs 


0.73 s 


Theorem 1 




256 x 256 


4.37 s 


.02 s 


190 


4.4 KBs 


0.73 s 


Previous state of the art 


512x512 


9759.18 s 


0.10 s 


767 


17.97 KBs 


6.07 s 


Theorem 1 




512x512 


37.85 s 


0.10 s 


236 


5.48 KBs 


6.07 s 



Table 1 : Experimental results for n x n MATMULT with our serial implementation. The Total Communication 
column does not count the communication required to specify the answer, only the "extra" communication 
required to run the verification protocol. 

when multiplying two 512 x 512 matrices, our prover takes about 38 seconds, while our comparison im- 
plementation takes over 2.5 hours. A C++ program that simply evaluates the circuit without an integrity 
guarantee takes 6.07 seconds, so our prover experiences less than a 7x slowdown in order to evaluate the 
circuit with an integrity guarantee relative to simply evaluating the circuit without such a guarantee. 

When multiplying two 512 x 512 matrices A and B, 236 rounds and the total communication cost of 
our protocol is 5.48 KBs (plus the amount of communication required to specify the answer AB). The 
previous state of the art required 767 rounds and close to 18 KBs of communication (plus the amount of 
communication required to specify AB). Notice that specifying a 512x512 matrix using 8 bytes per entry 
requires 2 MBs which is more than 500 times larger than the 5.48 KBs of extra communication required to 
verify the answer. 

A serial C++ program performing 512 x 512 matrix multiplication over the integers with floating point 
arithmetic (without going through the circuit representation of the computation) required 1.53 seconds, so 
our prover runs approximately 25 times slower than a standard unverifiable matrix multiplication algorithm. 
A serial C++ program performing the same multiplication over the finite field of size 2 61 — 1 required 4.74 
seconds, so our serial prover runs about 8 times slower than an unverifiable matrix multiplication algorithm 
over the corresponding finite field. 

Our verifier takes essentially the same amount of time as in prior work, as in both implementations the 
bulk of the work of the verifier is spent evaluating the low-degree extension of the input at a point. This is 
more than an order of magnitude faster than the 1 .03 seconds required by a serial C++ program performing 
the multiplication in an unverified manner over the integers, so the verifier is indeed saving time by using a 
prover (relative to doing the computation locally without a prover). We stress that the savings for the verifier 
would be larger at larger input sizes, as the time cost to the verifier in our implementation and the prior 
implementation of p4| is quasilinear in the input size, which is polynomially faster than all known matrix 
multiplication algorithms. Moreover, when streaming considerations are not an issue, we could apply the 
refinement of Vu et al. from LemmapUo reduce V's runtime from 0{n 2 \ogn) to 0(n 2 ) and thereby further 
speed up the verifier. 

DISTINCT. The costs of our serial DISTINCT implementation are displayed in Table|2] The comparison of 
our implementation with prior work is similar to the case of matrix multiplication. Our prover is roughly 200 
times faster than the comparison implementation. For example, when computing the number of non-zero 
entries of a vector of length 2 20 , our prover takes about 17 seconds, while our comparison implementation 
takes about 57 minutes. A C++ program that simply evaluates the circuit without an integrity guarantee 
takes 1.88 seconds, so our prover experiences roughly a lOx slowdown in order to evaluate the circuit with 
an integrity guarantee relative to simply evaluating the circuit. Our implementation required 1361 rounds 
and 40.76 KBs of total communication, compared to 3916 rounds and 91.3 KBs for the previous state of the 
art. This is essentially a 3x reduction in the number of rounds, and a 2.25x reduction in the total amount of 
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Implementation 


V Time 


V Time 


Rounds 


Total Communication 


Circuit Eval Time 


Previous state of the art 


3400.23 s 


0.20 s 


3916 


91.3 KBs 


1.88 s 


Theorem 1 




17.28 s 


0.20 s 


1361 


40.76 KBs 


1.88 s 



Table 2: Experimental results for computing the number of non-zero entries of a vector of length 2 20 with 
out serial implementation. 

communication. 

A C++ program that (unverifiably) computes the number of non-zero entries in a vector x with 2 20 
entries takes less than .01 seconds, and our prover implementation runs more than 1,700 times longer than 
this. The reason that the slowdown for the prover relative to an unverifiable algorithm is larger for DISTINCT 
than for MATMULT is that DISTINCT is a "less arithmetic" problem in the sense that the size of the arithmetic 
circuit we use for computing DISTINCT is more than lOOx larger than the runtime of an unverifiable serial 
algorithm for the problem. We stress however that, as pointed out in p8| , when solving the DISTINCT 
problem in practice, an unverifiable algorithm would first aggregate a data stream into its frequency- vector 
representation before determining the number of non-zero frequencies; in reporting a time bound of .01 
seconds for unverifiably solving DISTINCT, we are not taking the aggregation time cost into account. For 
sufficiently long data streams, the slow-down for our prover relative to an unverifiable algorithm would be 
much smaller than 1 , 700x if we took aggregation time into account. 

6.2.2 Parallel Implementation 

Our serial implementation demonstrates that V experiences a lOx slowdown in order to evaluate the circuit 
with an integrity guarantee relative to simply evaluating the circuit without such a guarantee. The purpose 
of this section is to demonstrate that parallelization can further mitigate this slowdown. To this end, we 
implemented a parallel version of our prover in the context of the matrix multiplication protocol of Section 
[5] Our parallel implementation uses a graphics processing unit (GPU). 

The high-level idea behind our parallel implementation is the following. Each time we apply the sum- 
check protocol to a polynomial gW , it suffices for V to evaluate gW at a large number of points r of the form 
p = {n,...,rj_ h t,bj +1 ,...,b Si+1 ) with? g {0,...,deg ; .(g«)} an d (b j+1 ,...,b Si+1 ) G {0,1}^'^'. We can 
perform each of these evaluations independently. Thus, we devote a single thread on the GPU to each value 
of (bj+\,. . -,b Si+l ) G {0, \} Si + l ~J and have that thread evaluate gW(r) at each of the deg 7 -(gW) + 1 points of 
the form (n, . . . ,rj_i,t,bj + i,. . . ,b s . +l ) with the help of the C^~ 1 ^ and \A J_1 ) arrays described in SectionDJ 
The one remaining issue is that after each round j of each invocation of the sum-check protocol, we need to 
update the arrays, i.e., we need to compute C^> and V^. To accomplish this, we devote a single thread to 
each entry of C^> and V^\ 

All steps of our parallel implementation achieve excellent memory coalescing, which likely plays a 
significant role in the large speedups we were able to achieve. For example, if two threads are updating 
adjacent entries of the array V^\ the only memory accesses that the threads need to perform are to adjacent 
entries of the array VU" 1 '. 

The results are shown in Table [3] we obtained about a 30x speedup for the prover relative to our serial 
implementation. The reported prover runtime does count the time required to copy data between the host 
(CPU) and the device (GPU), but does not count the time required to evaluate the circuit, which our imple- 
mentation does in serial for simplicity. While our implementation evaluates the circuit serially, this step can 
in principle be done in parallel one layer at a time, as these circuits have only logarithmic depth. Notice that 
when the circuit evaluation runtime is excluded, our parallel prover implementation runs faster in the case 
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Implementation 


Problem Size 


V Time 


Serial Circuit Eval Time 


Theorem 1 


Serial Implementation 


256 x 256 


4.37 s 


0.73 s 


Theorem ^Parallel Implementation 


256 x 256 


0.23 s 


0.73 s 


Theorem 1 


Serial Implementation 


512x512 


37.85 s 


6.07 s 


Theorem ljParallel Implementation 


512x512 


1.29 s 


6.07 s 



Table 3: Experimental results for n x n MATMULT with our parallel prover implementation. 



of 512x512 matrix multiplication than the time required to evaluate the circuit sequentially. 

It is possible that we would observe slightly larger speedups at larger input sizes, but our parallel im- 
plementation exhausts the memory of the GPU at inputs larger than 512x512. This memory bottleneck was 
also experienced by Thaler, Roberts, Mitzenmacher, and Pfister J38j, who used the GPU to obtain a parallel 
implementation of the protocol of Cormode et al. fl4) , and helps motivate the importance of the improved 
space usage of the special purpose MATMULT protocol we give later in Theorem [3] For comparison, the 
GPU implementation of [38] required 39.6 seconds for 256 x 256 matrix multiplication, which is about 
175x slower than our parallel implementation. 

We also mention that Thaler, Roberts, Mitzenmacher, and Pfister [38] demonstrate that equally large 
speedups via parallelization are achievable for the (already fast) computation of the verifier. These results 
directly apply to our protocols as well, as the verifier's runtime in both implementations is dominated by the 
time required to evaluate the MLE of the input at a random point 1 14} 38 1. 



7 Verifying General Data Parallel Computations 



In this section, our goal is to extend the applicability of the GKR protocol. While the GKR applies in prin- 
ciple to any function computed by a small-depth circuit, this is not the case when fine-grained efficiency 
considerations are taken into account. The implementation of Cormode et al. fR] required the programmer 
to express a program as an arithmetic circuit, and moreover this circuit needed to have a regular wiring 
pattern, in the sense that the verifier could efficiently evaluate the polynomials add,- and mult, at a point. If 
this was not the case, the verifier would need to do an expensive (though data-independent) preprocessing 
phase to perform these evaluations. Moreover, even for circuits with regular wiring patterns, this implemen- 
tation caused the prover to suffer an 0(log(5(n))) factor blowup in runtime relative to evaluating the circuit 
without a guarantee of correctness. The results of Sections [5] and [8] asymptotically eliminate the blowup in 
runtime for the prover, but they also only apply when the circuit has a very regular wiring pattern. 

The implementation of Vu et al. pO} allows the programmer to express a program in a high-level lan- 
guage, but compiles these programs into potentially irregular circuits that require the verifier to incur the 
expensive preprocessing phase mentioned above, in order for the verifier to evaluate the polynomials add, 
and mult, at a point. They therefore propose to apply their system in a "batching" model, where multiple in- 
stances of the same sub-computation are applied independently to different pieces of data. More specifically, 
their system applies the GKR protocol independently to each application of the computation, and relies on 
the ability of the verifier to use a single add, and mult, evaluation for all instances of the sub-computation, 
thereby amortizing the cost of this evaluation across the instances. To clarify, this use of a single add,- and 
mult, evaluation for all instances is only sound if all of the instances are checked simultaneously. If the 
instances are instead verified one after the other, then V knows V's randomness in all but the first instance, 
and can use that knowledge to mislead V. 

The batching model of Vu et al. is essentially identical to the data parallel setting we consider here. 
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Figure 2: Schematic of a data parallel computation. 

However, a downside to the solution of Vu et al. is that the verifier's work grows linearly with the "batch 
size" - the number of applications of the sub-computation that are being outsourced. We wish to develop a 
protocol whose costs to both the prover and verifier grow much more slowly with the batch size. 

7.1 Motivation 

As discussed above, existing interactive proof protocols for circuit evaluation either apply only to circuits 
with highly regular wiring patterns, or incur large overheads for the prover and verifier. While we do not have 
a magic bullet for dealing with irregular wiring patterns, we do wish to to mitigate the bottlenecks of existing 
protocols by leveraging some general structure underlying many real-world computations. Specifically, the 
structure we focus on exploiting is data-parallelism. 

By data parallel computation, we mean any setting in which the same sub-computation is applied in- 
dependently to many pieces of data, before possibly aggregating the results. Crucially, we do not want to 
make significant assumptions on the sub-computation that is being applied (in particular, we want to handle 
sub-computations computed by circuits with highly irregular wiring patterns), but we are willing to assume 
that the sub-computation is applied independently to many pieces of data. See Figure [2] for a schematic of a 
data parallel computation. 

We have already seen a very simple example of a data parallel computation: the DISTINCT problem. The 
circuit C from Section paused to solve this problem takes as input a vector a and computes a? mod q for 
all i (this is the data parallel phase of the computation), before summing the results (this is the aggregation 
phase). Notice that if the data stream consists of a sequence of words, then the DISTINCT problem becomes 
the word-count problem, a classic data parallel application. 

By design, the protocol of this section also applies to more complicated data parallel computations. 
For example, it applies to arbitrary counting queries on a database. In a counting query, one applies some 
function independently to each row of the database and sums the results. For example, one may ask "How 
many people in the database satisfy Property P?" Our protocol allows one to verifiably outsource such 
a counting query with overhead that depends minimally on the size of the database, but that necessarily 
depends on the complexity of the property P. 
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7.2 Overview of the Protocol 

Let C be a circuit of size S(n) with an arbitrary wiring pattern, and let C* be a "super-circuit" that applies C 
independently to B different inputs before aggregating the results in some fashion. For example, in the case 
of a counting query, the aggregation phase simply sums the results of the data parallel phase. We assume that 
the aggregation step is sufficiently simple that the aggregation itself can be verified using existing techniques, 
and we focus on verifying the data parallel part of the computation. 

If one naively applied the basic GKR protocol to the super-circuit C* , V might have to perform an ex- 
pensive pre-processing phase to evaluate add, and mult, at the necessary locations - this would require time 
Cl(BS). Moreover, when applying the basic GKR protocol to C* , V would require time &(BS ■ (log (B ■ S) ) . 

In order to improve on this, we observe that although each sub-computation C can have a very compli- 
cated wiring pattern, the circuit is maximally regular between sub-computations, as the sub-computations 
do not interact at all. Therefore, each time the basic GKR protocol would apply the sum-check protocol to a 
polynomial derived from the wiring predicate of C*, we can instead use a simpler polynomial derived only 
from the wiring predicate of C. By itself, this is enough to ensure that V's pre-processing phase requires 
time only O(S), rather than 0(B ■ S) as in a naive application of the basic GKR protocol. That is, the cost of 
V's pre-processing phase is essentially proportional to the cost of applying the GKR protocol only to C, not 
to the super-circuit C* . 

Furthermore, by combining this observation with the methods of Section [5} we can bring the runtime 
of V down to 0(B • S • logs'). That is, the blowup in runtime suffered by the prover, relative to performing 
the computation without a guarantee of correctness, is just a factor of logS - the same as it would be if the 
prover had run the basic GKR protocol on a single instance of the sub-computation. 

7.3 Technical Details 

7.3.1 Notation 

Let C be an arithmetic circuit over F of depth d and size S with an arbitrary wiring pattern, and let C* be the 
circuit of depth d and size B ■ S obtained by laying B copies of C side-by-side, where B = 2 h is a power of 
2. We assume that the in-neighbors of all of the Sj gates at layer i can be enumerated in 0(5,-) time. We will 
use the same notation as in Section [5} using *'s to denote quantities referring to C*. For example, layer i of 
C has size Si = 2 Si and gate values specified by the function V,-, while layer i of C* has size S* = 2 S < and gate 
values specified by the function V,*. We denote the length of the input to C* by n*. 

7.3.2 Main Theorem 

Theorem 2 For any point z € F\ there is a valid interactive proof protocol for computing Vj*(z) with the 
following costs. V spends 0{S) time in a pre-processing phase, and 0(n*logn* +d -log(B ■ S)) time in an 
online verification phase, where the n*logn* term is due to the time required to evaluate the multilinear 
extension of the input to C* at a point. V runs in total time 0(S ■ B -logS). The total communication is 
0(d ■ \og{B ■ S)) field elements. 

Proof: Consider layer i of C* . Let p = {p\,pi) € {0, 1} S * x {0, l} h be the label of a gate at layer i of C*, 
where pi specifies which "copy" of C the gate is in, while p\ designates the label of the gate within the copy. 
Similarly, let (O = (tOi, Oh) € {0, l} v '+» x {0, \} h and y = (ft, y 2 ) G {0, 1}'*» x {0, \} b be the labels of two 
gates at layer i + 1 . 
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It is straightforward to check that for all (pi,pi) £ {0, l} i; x {0, 1} , 

v*(pi,pi)= £ £ g {i) (pi,P2,G>i,Yi) 

a>i e{o.i} s '+i yie{0,l} s >'+i 



where 

g {i) (pi,P2,(Qi,yi) = 

P s f(z,(PhP2))- (addi(pi,0)i,ri) {V^ii^i, Pi) + V* +l (y u Pi)) + mult,(/? 1 ,C0i,7i) (V* +l ((O l ,p2)-V* +l {yi,p 2 ))) 

Essentially, this equation says that an addition (respectively, multiplication) gate p = (p\,pi) £ {0, \y i+b 
is connected to gates (O = ((01,(02) G {0,l} Si+1+b and 7 = (71,72) G {0, {} Si + 1+b if and only if p, CO, and 7 are 
all in the same copy of C, and p is connected to (0 and 7 within the copy. 

Lemma |4J then implies that for any z & F' v ^, 

v*(z)= £ g {i) (pi,P2,(oi,yi). 

(p 1 ,p 2 ,ffli,ri)e{o,i} I ;x{o,i}*x{o,i} I '+ix{o,i} I i"+i 

Thus, in iteration i of our protocol, we apply the sum-check protocol to the polynomial gW. The com- 
munication costs of this protocol are immediate. 

Costs for V. In order to run her part of the sum-check protocol of iteration i, V only needs to perform the 
required checks on each of V's messages. V's check requires 0(1) time in each round of the sum-check 
protocol except the last. In the last round of the sum-check protocol, V must evaluate the polynomial g( l > at 
a single point. This requires evaluating /3 S *, add,-, mult,-, and V* +l at a constant number of points. The V* +l 
evaluations are provided by V in all iterations i of the protocol except the last, while the /3 V * evaluation can 
be done in 0(log(B ■ S)) time. 

The add,- and mult,- computations can be done in pre-processing in time 0(5,-) by enumerating the in- 
neighbors of each of the S,- gates at layer / [ 14 , 40] . Adding up the pre-processing time across all iterations i 
of our protocol, V's pre-processing time is 0(£,->S,-) = O(S) as claimed. 

In the final iteration of the protocol, V no longer provides the V* +l evaluation for V; instead, V must 
evaluate the multilinear extension of the input at a point on her own. This can be done in a streaming manner 
using space 0(\ogn*) in time 0(n* logn*). The time cost for V in the online phase follows. 

Costs for V. It remains to show that V can perform the required computations in iteration i of the protocol 
in time 0((Sj +5,-+i) -fi-log(S)). To this end, notice gW is a polynomial in v := s ( - + 2s, + i +b variables. 
We order the sum in this sum-check protocol so that the 5,- +2y,-+i variables in p\, (0\, and 71 are bound 
first in arbitrary order, followed by the variables of p 2 . V can compute the prescribed messages in the first 
Sj + 2sj + i = 0(logS) rounds exactly as in the implementation of Cormode et al. | |14| . They show that each 
gate at layers i and i + 1 of C* contributes to exactly one term in the sum defining V's message in any 
given round of the sum-check protocol, and moreover the contribution of a given gate can be determined in 
0(1) time. Hence the total time devoted required by V to handle these rounds is 0(B- (Sj + S; + i) -logS). 
It remains to show how V can compute the prescribed messages in the final b rounds of the sum-check 
protocol while investing 0((S, • + S,- + i) B) across all rounds of the protocol. 

Recall that in order to compute V's message in round j of the sum-check protocol applied to gW, it suf- 
fices for V to evaluate gW at2 v_i points of the form (n,. . . ,rj_i,t,bj+i,. ■ . ,b v ), with? € {0, . . . ,deg(gM)} 
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and (bj+i,. ■ . ,b v ) G {0, 1} V ; . Each of these evaluations of gW can be computed in 0(1) time given the 
evaluations of f5 s *, add,-, mult;, and V* +l at the relevant points. 

Notice that once the variables in p\, G)\, and J\ are bound to specific values, say rf , r\ w , and r\ , 
addi(pi,(0\,Yi) and mult,-(/7i,fi)i,yi) are themselves bound to specific values, namely add,-(rj , r\ , r\ ) 

and mult,-(rj , rf* , r\ ). So V only needs to evaluate these polynomials once, and both of these evaluations 
can be computed by V in O(Si) time. Thus, the add,, mult,- evaluations in the last b rounds require just 0(S{) 
time in total. 

V can evaluate the function /3 S * at the relevant points exactly as in the proof of Theorem 1 using the C^ 
arrays to ensure that this computation is done quickly. The array C^ has size 2 s ' = 0(S,- ■ B), and C^ -1 ' 
gets updated to C^> whenever a variable in p\ or p2 becomes bound. This ensures that across all rounds of 
the sum-check protocol, the /3 S * evaluations require 0(5,- • B) time in total. 

Likewise, the V* +l evaluations can be handled exactly as in Theorem 1 using the the V^ arrays to ensure 
that this computation is done quickly. The array V^ has size 2^+i = 0(5', + i B), and V^ -1 ' gets updated 
to VV' whenever a variable in (0\ becomes bound (and similarly for the variables in /i). This ensures that 
across all rounds of the sum-check protocol, the V* +l evaluations take 0((5,- + 5,+i) • B) in total. 

Reducing to Verification of a Single Point. After executing the sum-check protocol at layer i as 
described above, V is left with a claim about V,- + i(o)i,/?2) and Vj + \(j\,p2), for (D\,J\ G ¥ Si , and p2 G ¥ b . 
This requires V to send Vt + i(£(t)) for a canonical line £(t) that passes through {(0\,p2) and (71,^2). It is 
easily seen that Vi+i(£(t)) is a univariate polynomial of degree at most s,-. Here, we are exploiting the fact 
that the final b coordinates of (0)1^2) and (71,^2) are equal. 

Hence V can specify Vi+\{£{t)) by sending Vj + \(£(tj)) for 0(si) many points tj G F. Using the method 
of LemmaBJ V can evaluate V, + i at each point £(tj) in 0(5, + i) time, and hence can perform all Vj + i(£(tj)) 
evaluations in 0(5', + i -s,-) = 0(5, + i Tog S) time in total. This ensures that across all iterations of our protocol, 
V devotes at most 0(SB- log S) time to the "reducing to verification of a single point" phase of the protocol. 
This completes the proof. 



In practice we would expect the results of the data parallel phase of computation represented by the 
super-circuit C* to be aggregated in some fashion. We assume this aggregation step is amenable to verifi- 
cation via other techniques. In the case of counting queries, the aggregation step simply sums the outputs 
of the data parallel step, which can be handled via Theorem [T] or slightly more efficiently via Proposition 
[7] described below in Section [8] More generally, if this aggregation step is computed by a circuit C of size 
0(S ■ B ■ log S/ log B) such that V can efficiently evaluate the multilinear extension of the wiring predicate of 
C', then we can simply apply the basic GKR protocol to C' with asymptotic costs smaller than those of the 
protocol described in Theorem [2] This application of the GKR protocol to C ends with a claim about the 
value of Vf (z) for some z G F^ . The verifier can then invoke the protocol of Theorembko verify this claim. 

We stress that the protocol of Theorem [2] can be applied if there are multiple data parallel stages inter- 
leaved with aggregation stages. 

8 Extensions 

In this section we describe two final optimizations that are much more specialized than Theorems [T] and [2] 



but have a significant effect in practice when they apply. In particular, Section 8.2 culminates in a protocol 
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for matrix multiplication that is of interest in its own right. It is hundreds of times faster than the protocol 
implied by Theorem [T] and studied experimentally in Section [6] 

8.1 Binary Tree of Addition Gates 

Cormode et al. [21 ] describe an optimization that applies to any circuit C with a single output that culminates 
in a binary tree of addition gates; at a high level, they directly apply a single sum-check protocol to the entire 
binary tree, thereby treating the entire tree as a single addition gate with very large fan-in. In contrast, the 
optimization described here applies to circuits with multiple outputs, and allows the binary tree of addition 
gates to occur anywhere in the circuit, not just at the layers immediately preceding the output. 

At first blush, our optimization might seem quite specialized since it only applies to circuits with a 
specific wiring pattern. However, this is one of the most commonly occurring wiring patterns, as evidenced 
by its appearance within the circuits computing MATMULT, DISTINCT, Pattern Matching, and counting 
queries. Notice that our optimization also applies to verifying multiple independent instances of any problem 
with a single output whose circuit ends with a binary tree of sum-gates, such as verifying the number of 
distinct items in multiple distinct data streams, or posing multiple separate counting queries to a database. 
This is because, similar to Theorem [2j one can lay the circuits for each of the individual problem instances 
side-by-side and treat the result as a single "super-circuit" culminating in a binary tree of addition gates with 
multiple outputs. 



The starting point for our optimization is the observation of Vu et al. |40| mentioned in Section 4.3.2 
in order to verify that V has correctly evaluated a circuit with many output gates, V may simply send V 
the (claimed) values of all output gates, thereby specifying a function V[ : {0, l}* 1 — > ¥ claimed to equal 
V\. V can pick a random point z £ ¥ S[ and evaluate V{(z) on her own in 0{Si) time. An application of the 
Schwartz-Zippel Lemma (Lemma [TJ implies that it is safe for V to believe that V\ is as claimed as long as 
Vi(z) = V{(z). Our protocol as described in Section p^ would then proceed in iterations, with one iteration 
per layer of the circuit and one application of the sum-check protocol per iteration. This would ultimately 
reduce V's claim about the value of V\(z) to a claim about Vd{z') for some z' £ IF^, where d is the input 
layer of the circuit. 

Instead, our final refinement uses a single sum-check protocol to directly reduce V's claim about Vi(z) 
to a claim about Vd(z') for some random points z' £ ¥ Sd . 



Proposition 7 Let C be a depth-d circuit consisting of a binary tree of addition gates, 2 inputs, and 2 



outputs. For any points z £ F^ d , V\(z) = H P ei0.i} k 8(p)> where 



g(p) =%(z,Pk-d+l,---,Pk)- 

Proof: At layer i of C, the gate with label p £ {0, l} Si is the sum of the gates with labels (p,0) and {p, 1) at 
layer / + 1. It is then straightforward to observe that the for any p £ {0, 1}* , the pth output gate has value 

Vi(pi,...,Pk-d)= Y. V d {Pu---,Pk-d,Pk-d+\,---,Pk)- (9) 

(pt-rf+i,-,Prf)e{0,l} rf 

Notice that the right hand side of Equation Q is a multilinear polynomial in the variables {pi,... ,Pk-d) 
that agrees with V\ {pi , . . . , Pk-d) at all Boolean inputs. Hence, the right hand side is the (unique) multilinear 
extension Vi of the function Vi : {0, \} k ~ d — > {0, 1}. The theorem follows. ■ 
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2.52 s 
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0.73 s 


Theorem 1 
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512x512 


37.85 s 


0.10 s 


236 


5.48 KBs 


6.07 s 


Proposition 


7 


512x512 


22.98 s 


0.10 s 


39 


0.86 KBs 


6.07 s 



Table 4: Experimental results for n x n MATMULT, with and without the refinement of Section 8.1 As 



in Table [TJ the Total Communication column does not count the n 2 field elements required to specify the 



answer. 



In applying the sum-check protocol to the polynomial g in Proposition [7} it is straightforward to use the 
methods of Section 5.4.2 to implement the honest prover in time 0(2*). We omit the details for brevity. 



Experimental Results. Let C be the circuit for naive matrix multiplication described in Section 5.5.1| To 
demonstrate the efficiency gains implied by Proposition [7] we modified our MATMULT implementation of 
Section 6.2. 1 to use the protocol of Proposition|7Jto verify the sub-circuit of C consisting of a binary tree of 
addition gates. The results are shown in Table |4| Our optimizations in this section shave "P's runtime by a 
factor of 1.5x-2x, the total number of rounds by a factor of more than 5, and the total communication (not 
counting the cost of specifying the output of the circuit) by a factor of more than 5. 



8.2 Optimal Space and Time Costs for MATMULT 

We describe a final optimization here on top of Proposition[7J While this optimization is specific to the MAT- 
MULT problem, its effects are substantial and the underlying observation may be more broadly applicable. 

Suppose we are given an unverifiable algorithm for n x n matrix multiplication that requires time Tin) 
and space s{n). Our refinements reduce the prover's runtime from 0(n 3 ) in the case of Sectionsppnd 8. 1 to 
T(n) + 0(n 2 ), and lowers Vs space requirement to s{n) +o{n 2 ). That is, in the protocol the prover sends 
the correct output and performs just 0(n 2 ) more work to provide a guarantee of correctness on top. It is 
irrelevant what algorithm the prover uses to arrive at the correct output - in particular, algorithms much 
more sophisticated than naive matrix multiplication are permitted. This runtime and space usage for V are 
optimal even up to the leading constant assuming matrix multiplication cannot be computed in 0(n 2 ) time. 

The final protocol is extremely natural, as it consists of a single invocation of the sum-check protocol. 



We believe this protocol is of interest in its own right. The proof and technical details are in Section 8.2.2 



Theorem 3 There is a valid interactive proof protocol for nxn matrix multiplication over the field ¥ q with 
the following costs. The communication cost is n 2 + O(logn) field elements. The runtime of the prover is 
T(n) + 0(n ) and the space usage is s{n)+o{n 2 ), where T(n) and s(n) are the time and space requirements 
of any (unverifiable) algorithm for nxn matrix multiplication. The verifier can make a single streaming 
pass over the input as well as over the claimed output in time 0(n 2 \ogn), storing 0(logn) field elements. 

Using the observation of Vu et al. described in Lemma[3j the runtime of the verifier can be brought down 
to 0(n 2 ) at the cost of increasing Vs space usage to 0(n 2 ). Furthermore, by RemarkfT] the runtime of the 
verifier can be brought down to 0(n 2 ) while maintaining the streaming property if the input matrices are 
presented in row-major order. 

The prover's runtime in Theorem [3] is within an additive low-order term of any unverifiable algorithm 
for matrix multiplication; this is essential in many practical scenarios where even a 2x slowdown is too steep 
a price to pay for verifiability. Notice also that the space usage bounds in Theorem [3] are in stark contrast 
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to protocols based on circuit-checking: the prover in a general circuit-checking protocol may have to store 
the entire circuit, and this can result in space requirements that are much larger than those of an unverifiable 
algorithm for the problem. For example, naive matrix multiplication requires time 0(n 3 ), but only 0(n 2 ) 
space, while the provers in our MATMULT protocols of Sectionsppnd 8.1 require both space and time 0(« 3 ). 



As implementations of interactive proofs become faster, the prover is likely to run out of space long before 
she runs out of time. 

8.2.1 Comparison to Prior Work 

It is worth comparing Theorem[3]to a well-known protocol due to Freivalds [ 17]. Let D* denote the claimed 
output matrix. In Freivalds' algorithm, the verifier stores a random vector x G F", and computes D*x and 
ABx, accepting if and only if ABx = D*x. Freivalds showed that this is a valid protocol. In both Freivalds' 
protocol and that of Theorem pi the prover runs in time T(n) + 0(n 2 ) (in the case of Freivalds' algorithm, 
the 0(n 2 ) term is 0), and the verifier runs in linear or quasilinear time. 

We now highlight several properties of our protocol that are not achieved by prior work. 

Utility as a Primitive. A major advantage of Theorem [3] relative to prior work is its utility as a primitive 
that can be used to verify more complicated computations. This is important as many algorithms repeatedly 
invoke matrix multiplication as a subroutine. For concreteness, consider the problem of computing A 2 via 
repeated squaring. By iterating the protocol of Theorem [3]& times, we obtain a valid interactive proof pro- 
tocol for computing A 2 with communication cost n 2 + 0(k\og(n)). The n 2 term is due simply to specifying 
the output A 2 , and can often be avoided in applications - see for example the diameter protocol described 
two paragraphs hence. The /th iteration of the protocol for computing A reduces a claim about an eval- 
uation of the multilinear extension of A 2 ' to an analogous claim about A 2 ' . Crucially, the prover in 
this protocol never needs to send the verifier the intermediate matrices A 2 for k' < k. In contrast, applying 
Freivalds' algorithm to this problem would require 0{kn 2 ) communication, as V must specify each of the 
intermediate matrices A 2 ' . 

The ability to avoid having V explicitly send intermediate matrices is especially important in settings 
where an algorithm repeatedly invokes matrix multiplication, but the desired output of the algorithm is 
smaller than the size of the matrix. In these cases, it is not necessary for V to send any matrices; V can 
instead send just the desired output, and V can use Theorem [3] to check the validity of the output with only 
a polylogarithmic amount of additional communication. This is analogous to how the verifier in the GKR 
protocol can check the values of the output gates of a circuit without ever seeing the values of the "interior" 
gates of the circuit. 

As a concrete example illustrating the power of our matrix multiplication protocol, consider the funda- 
mental problem of computing the diameter of an unweighted (possibly directed) graph G onn vertices. Let 
A denote the adjacency matrix of G, and let / denote the nxn identity matrix. Then it is easily verified that 
the diameter of G is the least positive number d such that (A + /)f,- / for all (i,j). We therefore obtain 
the following natural protocol for diameter. V sends the claimed output d to V, as well as an (i,j) such that 
(A + /)f, rl = 0. To confirm that d is the diameter of G, it suffices for V to check two things: first, that all 
entries of (A + I) d are non-zero, and second that (A+I)fj~ is indeed non-zero. 

The first task is accomplished by combining our matrix multiplication protocol of Theorem [3] with 
our DISTINCT protocol from Theorem [T] Indeed, let dj denote the jth bit in the binary representation of 

d. Then (A + I) d = Ylj (A + /) 2J , so computing the number of non-zero entries of (A + I) d can be 
treated as a sequence of 0(\ogd) matrix multiplications, followed by a DISTINCT computation. The second 
task, of verifying that (A + I) d f l = 0, is similarly accomplished using 0(log<i) invocations of the matrix 



"j 
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multiplication protocol of Theorem pi- since V is only interested in one entry of (A + I) d ~ l , V need not 
send the matrix (A + I) d ~ l in full, and the total communication here is just polylog(n). 

V's runtime in this diameter protocol is 0{m\ogn), where m is the number of edges in G. P's runtime 
in the above diameter protocol matches the best known unverifiable diameter algorithm up to a low-order 



additive term [33 42 1, and the communication is just polylog(n). We know of no other protocol achieving 
this. 

As discussed above, the fact that V's slowdown is a low-order additive term is critical in the many 
settings in which even a 2x slowdown to achieve verifiability is unacceptable. Moreover, for a graph with 
n = 1 million nodes, the total communication cost of the above protocol is on the order of KBs - in contrast, 
if V had to send the matrices (I +A) d or (I +A) d ~ l explicitly (as required in prior work e.g. Cormode et 
al. (l3||), the communication cost would be at least n 2 = 10 12 words, which translates to terabytes of data. 

Small-Space Streaming Verifiers. In Freivalds' algorithm, V has the store the random vector x, which 
requires Q.(n) space. There are methods to reduce V's space usage by generating x with limited randomness: 
Kimbrel and Sinha [26] show how to reduce V's space to Oilogn), but their solution does not work if V must 
make a streaming pass over arbitrarily ordered input. Chakrabarti et al. [12] extend the method of Kimbrel 
and Sinha to work with a streaming verifier, but this requires V to play back the input matrices A,B in a 
special order, increasing proof length to 3n 2 . Our protocol works with a streaming verifier using 0(logn) 
space, and our proof length is n 2 + 0(\ogn), where the n 2 term is due to specifying AB and can be avoided 
in applications such as the diameter example considered above. 

8.2.2 Protocol Details 

The idea behind the optimization is as follows. All of our earlier circuit-checking protocols only make use 
of the multilinear extension V,- of the function V, mapping gate labels at layer i of the circuit to their values. 
In some cases, there is something to be gained by using a higher-degree extension of V,-, and this is precisely 
what we exploit here. By using a higher-degree extension of the gate values in the circuit, we are able to 
apply the sum-check protocol to a polynomial that differs from the one used in Section [5] In particular, the 
polynomial we use here avoids referencing the f} Sj polynomial used in Section|5] Details follow. 

When multiplying matrices A and B such that AB = D, let A(i,j), B(i,j) and D(i,j) denote functions 
from {0, l} log " x {0, l} log " -> ¥ q that map input (i,j) to A iJ7 B tj , and D tj respectively. Let A, B, and D 
denote their multilinear extensions. 

Lemma 5 For all (p u p 2 ) G F logM x F log ", 

D{pi,Pi)= Y, A(jpi,p 3 )-E(j>3,p2) 

p 3 e{o,i} lo g" 

Proof: For all (pi,P2) G {0, l} log " x {0, l} log ", the right hand side is easily seen to equal D(p\,p2), using 
the fact that D, ; = Y^k^ik^kj an d the fact that A and B agree with the functions A(i, j) and B(i , j) at all Boolean 
inputs. Moreover, the right hand side is a multilinear polynomial in the variables of (p\,p2)- Putting these 
facts together implies that the right hand side is the unique multilinear extension of the function D(i, j). ■ 

Lemma [5] implies the following valid interactive proof protocol for matrix multiplication: V sends a 
matrix D* claimed to equal the product D = AB. V evaluates D* (n , r2) at a random point (r\ , r2) 6 F g " x 
jpiogn gy ^ Schwartz -Zippel lemma, it is safe for V to believe D* is as claimed, as long as D*{r\,r2) = 
D(r\,r2) (formally, if D* ^ D, then D*(r\,r2) ^D{r\,r2) with probability \—2\ogn/q). In order to check 
that D*(r\,r2) =D(n,r2), we invoke a sum-check protocol on the polynomial g (pi) =A(ri,pj,) -B(p^,r2). 
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V's final check in this protocol requires her to compute g(r^) for a random point r^ G F log ". V can do this 
by evaluating both of A(r\ , 7-3) and B(rs , ri) with a single streaming pass over the input, and then multiplying 
the results. 

The prover can be made to run in time T(n) + 0{n 2 ) across all rounds of the sum-check protocol using 
the V^> arrays described in Section [5j to quickly evaluate A and B at all of the necessary points. The V^-^ 
arrays are initialized in round to equal the input matrices themselves, and there is no need for V to maintain 
an "uncorrupted" copy of the original input (though in practice this is likely to be desirable). Thus, the V^' 
arrays can be computed using the storage V initially devoted to the inputs, and V needs to store just 0(1) 
additional field elements over the course of the protocol (V does not even need to store the messages sent by 
V, as V need not refer to the jth message once the array V^) is computed). The claimed s(n) + o(n 2 ) space 
usage bound for V follows. 

Remark 8 Let C be the circuit for naive matrix multiplication described in Section^\ Notice that the 3 log n- 
variate polynomial h(p\,p2,P3) = A(pi,pi) -B(p^,p2) extends the function Vj mapping gate labels at layer 
i = log « ofC to their values. However, h is not the multilinear extension ofVi, as h has degree two in the 
variables ofp?. 

Informally, Theorem^cannot be said to perform "circuit checking " on C, since it is not necessary for V 
to evaluate all of the gates in C; indeed, the prover in Theorem^can run in sub-cubic time using fast matrix 
multiplication algorithms. However, the use of a low-degree extension of the gate values at layer logn ofC 
allows one to view the protocol ofTheorem^as a direct extension of the circuit-checking methodology. 

Remark 9 Consider the problem of computing a matrix power M via repeated squaring. We may apply the 
protocol ofTheorem\3jin k iterations, with the ith iteration applied to inputs A = B = M 2 . The ith iteration 
of this protocol reduces a claim about an evaluation of the multilinear extension ofM 2 to an analogous 
claim about the multilinear extension of M 2 at two points of the form (r\,r^), (r^,^) £ Y lcg og ". We can 
further reduce the claims about (n,^), (r^,^) to a claim about a single point exactly as in the "Reducing 
to Verification of a Single Point" step of the GKR protocol. We then move onto iteration /+ 1. Notice in 
particular that the verifier only needs to observe the output matrix M and the input matrix M to run this 
protocol; in particular, V does not need to explicitly send the intermediate matrices M 2 to V 

We implemented the protocol just described (our implementation is sequential). The results are shown 
in Table [5] where the column labelled "Additional Time for V" denotes the time required to compute V's 
prescribed messages after V has already computed the correct answer. We report the naive matrix multi- 
plication time both when the computation is done using standard multiplication of 64-bit integers, as well 
as when the computation is done using finite field arithmetic over the field with q = 2 61 — 1 elements. The 
reported verifier runtime is for the 0(n 2 \ogn) time reported in Theorem pi The verifier's runtime could be 
improved using Lemma[3]at the cost of increasing V's space usage to 0(n), but we did not implement this 
optimization. Moreover, if the input matrices are presented in row-major order, then the observation of Vu 
et al. described in Remark [T] improves V's runtime with no increase in space usage. 

The main takeaways from Table [5] are that the verifier does indeed save substantial time relative to 
performing matrix multiplication locally, and that the runtime of the prover is hugely dominated by the time 
required simply to compute the answer. 

9 Conclusion 

We believe our results substantially advance the goal of achieving a truly practical general purpose imple- 
mentation of interactive proofs. The 0(log5'(n)) factor overhead in the runtime of the prover within prior 
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Table 5: Experimental results for the n x n MATMULT protocol of Theorem [3] 

implementations of the GKR protocol is too steep a price to pay in practice, and our refinements (formal- 
ized in Theorem [T| remove this logarithmic factor overhead for circuits with regular wiring patterns. Our 
experiments demonstrate that this protocols yields a serial prover that is less than lOx slower than a C++ 
program that simply evaluates the circuit in serial, and that our protocols are highly amenable to paral- 
lelization. Exploiting similar ideas, we have also extended the reach of prior interactive proof protocols by 
describing an efficient protocol (formalized in Theorem [2]) for general data parallel computation, and given 
a protocol for matrix multiplication in which the prover's overhead (relative to any unverifiable algorithm) 
is just a low-order additive term. The latter is a powerful primitive for verifying the many algorithms that 
repeatedly invoke matrix multiplication. A major message of our results is that the more structure that exists 
in a computation, the more efficiently it can be verified both in theory and in practice, and that this structure 
exists in many real-world computations. 

We believe two directions in particular are worthy of future work. The first direction is to build a full- 
fledged system implementing our protocol for data parallel computation. Our vision is to combine our 
protocol with a high-level programming language allowing the programmer to easily specify data parallel 
computations, analogous to frameworks such as MapReduce. Any such program could be automatically 



compiled in the manner of Vu et al. |40| into a circuit, and our protocol could be run automatically on 
that circuit. The second direction is to further enable such a compiler to automatically take advantage 
of our other refinements, which are targeted at computations that are not necessarily data parallel. These 
refinements apply to a circuit on a layer-by-layer basis, so they may yield substantial speedups in practice 
even if they apply only to a subset of the layers of a circuit. 

Acknowledgements. The author is grateful to Frank McSherry for raising the question of outsourcing 
general data parallel computations, and to Michael Mitzenmacher and Graham Cormode for discussions 
and feedback that greatly improved the quality of this manuscript. 
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A Proof of Theorem [j] 

Proof: Consider layer i of the circuit C. Since in[ ' and in;, are regular, there is a subset of input bits 5, C [v] 
with 1 5,- 1 = Ci for some constant c, such that each input bit in [v] \ S affects 0(1) of the output bits of in/ 
and inj . Number the input variables so that the numbers {1, . . . ,c,-} correspond to variables in Sj. 

Let p G {0, 1} C/ be an assignment to the variables in S, and let I p : {0, 1}** —> {0, 1} denote the indicator 
function for p. For example, if a = 3 and p = (1,0, 1), then I p (x) = 1 if x\ = 1,X2 = 0, and xt, = 1, and 
Ip(x) =0otherwise. Let/ p denote the multilinear extension of I p . In the previous example, I p =;ci(l— X2)x^. 
Finally, let in/ and in!, denote the functions in/ and in!, with the variables in Sj fixed to the assignment 
p, and for k G {1,2}, let b Pik ,j denote the jth output bit of in[ 

By regularity, for each assignment p G {0, 1} C ' to the variables in Si, the jth output bit b p ^j of in„ 
depends on only one variable x q ^ pk j^ G [sj] \<S,- for some function q(p,k,j). Let b p ,kj(Xq( p ,k,j)) : F — >■ F 
denote the multilinear extension of the function b pk j(x q i pk j\) : {0, 1} — > {0, 1}. If b p<k j is not identically 
or identically 1, then either b p , k j(x q{PikJ) ) = x q ( PtkJ ) or b p . kJ = 1 -x q{pXj) . 

For any p G {0, l} Si , define m { p to be the concatenation of the b P) \ j functions for all j G [si+i]. Under 

this definition, in t p is a collection of si+i linear polynomials, where each of the polynomials depends on a 

single variable, and we may view iri] ' as a single function mapping ¥ Si to F i,+1 . We define in 2 p and typej/ 
analogously to ini. 
Now let 



W ( ')(p) 



(0 („\V... (d$ (r,\\ .V.. . foM (r,\\±(l -t*™ (0 (r,\\ (v.., faM 



£ I p (p) ■ (typeW ( p )V i+[ (ini'/ p (p)J • V i+l {&%, ip)) + (l -type/' (p)) [V i+1 [^ p (p)) +V i+1 (&%, ( P ) 

It is easily checked that for all p G {0, l}* 1 ', V{(p) = W^\p). Lemma UJ then implies that V{(z) = 
EpefO.l }'■• 8 (p) ; where g^ l \p) = j5 Si (z,p) -W^(p). Our protocol follows precisely the description of Sec- 



tion 



5.1 



with V and V applying the sum-check protocol to the polynomial gW at iteration i. 

Communication Costs and Costs to V. Notice that our polynomial g^ l '(p) = P(z,p) -W^'(p) has de- 
gree 0(1) in each variable. Indeed, fi(z,p) has degree 1 in each variable. Moreover, W^'{p) is a sum of 
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polynomials that each have degree 0(1) in each variable, and hence W^(p) itself has degree 0(1) in each 
variable. 

This latter fact can be seen by observing that for each assignment p G {0, 1} Q to the variables in <S,-, it 



holds that I p (p), type„ (p), Vi+i (inj ' (p) J and V/ + i (in 2p (p) ) all have constant degree in each variable. 

That Vj + [ ( inj „ (p) ) and Vi + \ ( in 2 p (p) J have constant degree in each variable follows from the facts that 

V, + i is a multilinear polynomial, and that each input variable j G [si] \<Sj affects at most a constant number 
of outputs for inip and in 2 ,p by Property 1 of Definition BJ 

Since gW (p) has degree 0(1) in each variable, the claimed communication cost and the costs to the ver- 
ifier follow immediately by summing the corresponding costs of the sum-check protocols over all iterations 



i G {!,••• ,rf(«)} (see Section 4.2) 



Time Cost for V. It remains to demonstrate how V can compute her prescribed messages when applying 
the sum-check protocol to the polynomial gW i n time 0(5,- + 5,- + i). It will follow that "P's runtime over all 
d(n) invocations of the sum-che ck pr otocol is 0(£ /= :" Si) = 0(S(n)). 



As in our analysis of Section 5.4 it suffices to show how V can quickly evaluate gW a t all points in S^\ 
where S^ consists of all points of the form p = (ri,.. .,rj_\,t,pj+\, . . . ,p Sj ) with t & {0,1,... ,deg.-(g^)} 

and (pj+i,...,p„) G {0,1}"--'. Asg®(p) = A,(z,/>) -W®(p), it suffices for P to evaluate j5 s ,.(z,-) andW(-) 
at all such points /?. The j3 S[ .(z, •) computations can be done in 0(5;) total time across all iterations of the 



sum-check protocol, exactly as in Section 5.4.1 



To see how V can efficiently evaluate all of the W®(p) values efficiently, notice that for any fixed 
point p G F v ', W^'(p) can be computed efficiently given typey(p), Vi+\(ini tP (p)), and Vi+i(in2,p(p)) for 
all p G {0, l} Ci . As 1 5,- 1 = C{ = 0(1), modulo a constant-factor blowup in runtime it suffices to explain how 
to perform these evaluations for a fixed restriction p G {0, 1} Q to the variables in 5,-. 

It is easy to see that typey(p) can be evaluated in constant time, since this function depends on only 1 
input variable x q ^ p 3l y All that remains is to show how V can evaluate Vi+i(ini iP (p)) quickly; the case for 
Vi+i(in 2tP (p)) is similar. 

To this end, we follow the approach of Section [5A2J 

Pre-processing. V will begin by computing an array V®\ which is simply defined to be the vector of gate 
values at layer i+ 1 i.e. identifying a number < j < 5j+i with its binary representation in {0, 1 } Si+l , V 
setsVW[(j u ...,j Si+1 )]=V i+1 (j h ...,j Si+1 ) for each (;i,...,7, i+1 ) G {0,1}^'. The right hand side of this 
equation is simply the value of the jth gate at layer /+ 1 of C. So V can fill in the array V^ when she 
evaluates the circuit C, before receiving any messages from V. 

Overview of Online Processing. Assume without loss of generality that the output bits of inip(^) are 
labelled in increasing order of the input bits they are affected by. So for example if p\ affects 2 output bits 
of ini p and p 2 affects 3 output bits, then the bits affected by p\ are labelled 1 and 2 respectively, while the 
bits affected by p 2 are labelled 3, 4, and 5. 

In round j of of the sum-check protocol, V needs to evaluate the polynomial Vj + \ at the 0(2' S, * +1_J ) points 
in the sets ini 5 p(S^) and hi2.p(S^). V will do this using the help of intermediate arrays as follows. 

Efficiently Constructing V^^ Arrays. Let a/_i denote the total number of output bits affected by the first 
j — 1 input variables. Inductively, assume V has computed in the previous round an array V^ -1 ) of length 
2*+i-a/-i j suc h that for each p = (p aj _ l+ i ,... ,p Si+i ) G {0, l}*+i-°/-i , the pth entry of V^ 1 ) equals 
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j- 1 

V^'^iiPaj-i+h- ■ -,Ps i+l )] = E V£+i(ci,. • • .S-i'^-i+i' • ' ■'P*M)'HXc k (bp,l^(r^p 1 i t k))), 

(c 1 ,.., Vl )e{0,irM *=i 

where recall that q(p, i,k) is the input bit that output bit k of ini p depends on. As the base case, we 
explained how V can fill in V^ in the process of evaluating the circuit C. 

Let x\ , . . . , x Si denote the input variables to ini , and let b\ , . . . , b Sj+1 denote the outputs of ini . Intuitively, 
at the end of round j of the sum-check protocol, V must "bind" input variable Xj to value rj G F. This has 
the effect of binding the output variables affected by Xj, since each such output variable depends only on 
xj. For illustration, suppose the variable x\ affects output variable b\\ specifically, suppose that b\ = \ —x\. 
Then binding x\ to value r\ has the effect of binding b\ to value 1 — r\. V^' is obtained from yC/ -1 ) by 
taking this into account. We formalize this as follows. 

Assume that variable xj affects only one output variable b p \ a _ 1+ i, and thus aj = a;_i + 1; if this is not 
the case, we can compute V^' by applying the following update once for each output variable affected by 
Xj. Observe that V can compute V^ given V^ -1 ) in 0(2 Si+l ~ a J- [ ) time using the following recurrence: 

Thus, at the end of round j of the sum-check protocol, when V sends V the value rj, V can compute 
V^ from VV'V in 0(2'w - ^- 1 ) time. 

Using the V^> Arrays. We now show how to use the array yv -1 ) to evaluate Vi + i(mi iP (p)) in constant time 
for any point p of the form p = (ri,...,rj-i,t,pj + i,...,p Sj ) with (pj+i,. ■ ■ ,p Si ) G {0,l} Si ~ J '. In order to 
ease notation in the following derivation, we make the simplifying assumption that 5p,i,jtOWp,i,jfc)) = Xq(p,ijc) 
for all output bits k G [s, + i]. The derivation when this assumption does not hold is similar. 
We exploit the following sequence of equalities: 

V(m, p ( P ))= £ V i+l (c)Xc(m, P (p)) 

E E v i+l (c)Xc(m, P (p)) 

{ci,...,ca J _ l Mo > ip- i (c a ._ 1+1 ,...,c Si+1 Mo,iyM- a j-i 

/".;-! \ / a i \ / Si+i 

E L %l( c ) [HXc k (bp tltk (r q{pAtk) ))U Y[ Xc k (b p ,u(t))\[ [7 Xc k (p q (p,i,k)) 

( Cl ,..., s ._ 1 )e{0,l}^-i( s ._ i+1 ,..., c ,, +1 )e{0,ir-+>- n J -i \*=1 / \k=»j-i+l J \k=a } +l 

e v;-+i(c; + i,...,c flj ,p 9( p ilifl . +1) ,...,p 9(PiM . +1) ) n^>(^) )•( n Xc k (t)\ 

( C| ,..., S .)G{0,lp \* =1 / V=«j-l + l J 



a 



E V ° l) [(Pq(p,haj- 1+ l),---,Pq(p,l, S j +1 ))}- EI &»*(') 

(p a ._ 1+1 ,..„ Pa .)e{0,ip-'V- 1 **=a/-i+i 

Here, the first equality holds by Equation ([8]). The third holds by definition of the functions % c and 
ini, as well as the assumption that b p ,]_,k( x q (p.i.k)) = x q(p.\,k) f° r an k £ k'+i]- The fourth holds because for 
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Boolean values c k ,p q{p<hk) G {0, 1}, Xc k (j>q(p,i,k)) = 1 if c* = p q ( p ,i,k), and Xc k (p q ( P ,i,k)) = otherwise. The 
final equality holds by definition of the array V^ -1 \ 

The final expression above can be computed with OQ a i~ a '~ 1 ) time given the array V^~ { > . Since a.j — 
dj-i is constant by Property 1 of Definition [3J 0(2 a J~ a J- 1 ) = 0(1). 

Putting Things Together. In round j of the sum-check protocol, V uses the array V (->~ ! ' to evaluate V, + i(m\(p)) 
for all 0(2 Si ~J) points p G S^> , which requires constant time per point and hence 0(2 Si ~J) time over all points 
in S^K At the end of round j, V sends V the value r ; -, and V computes V") from V^ -1 ) in 0(2 Si+1 ~ a ^- 1 ) 
time. By ordering input variables in such a way that aj > aj-\ for all j, we ensure that in total across 
all rounds of the sum-check protocol, V spends 0(Y?f = i 2 lS; ~ ; + 2 Si+l ~J) = 0(2 Si + 2 V,+1 ) time to evaluate 
V/ + i at the relevant points. When combined with our 0(2 iV )-time algorithm for computing all the relevant 
P(z,p) values, we see that V takes 0(2 Si + 2 S,+1 ) = 0(Sj + 5,-+ 1) time to run the entire sum-check protocol 
for iteration i of our circuit-checking protocol. 

Reducing to Verification of a Single Point. After executing the sum-check protocol at layer i as 
described above, V is left with a claim about VJ + i( w i) and Vi+i{G>l) from two points (0\,(02 G F s,+1 . If i 
is a layer for which in/ and in!, are similar (see Definition 4 1, we run the reducing to verification of a 
single point phase exactly as in the basic GKR protocol. This requires V to send V, + i (£(?)) for a canonical 
line £(t) that passes through the points (0\ and o>i- Because in/ and in;, are similar, it is easily seen 
that Vi + i(£(t)) is a univariate polynomial of constant degree. Hence V can specify V{ + \(£(t)) by sending 
Vj + i(£(tj)) for 0(1) many points tj G F. Using the method of LemmaBl V can evaluate Vi + \ at each point 
£(tj) in 0(Sj + i) time, and hence can perform all V, + i (£(tj)) evaluations in 0(Si + \) time in total. 

Let c = 0(1) be the number of layers i for which in^ and in;, are not similar. At each such layer i, we 
skip the "reducing to verification at a single point" phase of the protocol. Each time we do this, it doubles 
the number of points CO € F v,+1 that must be considered at the next iteration. However, we only skip the 
"reducing to verification at a single point" phase c times, and thus at all layers i of the circuit, V needs to 
check Vi(oOj) for at most 2 C = 0(1) points. This affects Vs and V's runtime by at most a 2 C = 0(1) factor, 
and the 0(5) time bound for V, and the 0(n\ogn + d(n) \ogS(n)) time bound for V follow. ■ 



B Analysis for Pattern Matching 



Let C be the circuit for pattern matching described in Section |53T Our goal in this appendix is to handle 



the layer of the circuit adjacent to the input layer. Call this layer £. Layer £ computes ti+k — p k for each pair 
(i,k) G [[«]] x [[q]]. We want to show how to use a sum-check protocol to reduce a claim about the value 
of Ve(z) for some z, G ¥ Sf to a claim about V^ + i(r) for some r G F V/+1 , while ensuring that V runs in time 
0(S e ) = 0(nm). 

The idea underlying our analysis here is the following. The reason Theorem [T] does not apply to layer 
£ is that the first in-neighbor of a gate with label p = (i\,... , h g n ,ki,. . . ,k\ ogm ) G |q, l} logn+logm has label 
equal to the binary representation of the integer i + k, and a single bit /& can affect many bits in the binary 
representation of i + k (likewise, each bit in the binary representation of i + k may be affected by many bits 
in the binary representation of i and k). In order to ensure that each bit of p affects only a single bit of 
y = in\ (p), we introduce logn dummy variables (c\,. . . ,ci ogn ) and force the jth dummy variable c } - to have 
value equal to the jth carry bit when adding numbers i and k in binary. Now each bit of p affects only one 
output bit, and each output bit yj is only affected by at most three "input bits" : ij , kj , and Cj if j < log m, and 
just ij and Cj if j > log m. 
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To this end, let : {0, l} 4 —> {0, 1} be the function that evaluates to 1 on input (h,ki,CQ,ci) if and only 
if c\ = and i\ + k\ + cq < 2 or c\ = 1 and i + k + cq > 2. That is, outputs 1 if and only if c\ is equal to 
the carry bit when adding h,k\, and cq. Let be the multilinear extension of </>. Notice can be evaluated 
at any point r € F 4 in 0(1) time. 

Now let (i,k,c) denote a vector in F log " x F logm x F log ", and define 

logn 

3>(/,fc,c) := Yl<j>(ij,k J: Cj-i,Cj), 

7=1 

where it is understood that c_ i = and kj = for j > log m. 

For any Boolean vector (i,*,c) E {0, l} log " x {0, l} logm x {0, l} log ", it is easily verified that 3>(i,fc,c) = 1 
if and only if for all /', c ; equals the jth carry bit when adding numbers i and & in binary. 

Finally, let y : {0, l} 3 —> {0, 1} be the function that evaluates to 1 on input (ii,ki,c\) if and only if 
i\ +k\+c\ = \ mod 2. Let f be the multilinear extension of y. Notice f can be evaluated at any point 
rGF 3 in 0(1) time. 

Now consider the following log« + logm-variate polynomial over the field F: 

W^\i : k) = £ ®(i,k,c) ■ (f(f(h +h +c ), .. . ,f(h og n +h ogn +c logM _i)) -P(k u . . .,k logm )) , 

( c ,,... !C , og „)e{o,i} to s" 

where again it is understood that c_i = and kj = for j > logm. Here, f is the multilinear extension 
of the input T ', viewed as a function from {0, l} log " to [n], and P is the multilinear extension of the input 
pattern P, viewed as a function from {0, l} logm to [n]. 

It can be seen that for all Boolean vectors (i,k) = {0, l} log " x {0, l} logm , W^\i,k) = V ( (i,k). This is 
because for any (i,k) E {0, l} log " x {0, l} logm , <J>(/, k,c) will be zero for all c except the c consisting of the 
correct carry bits for i and k, and for this input c, f(f(i\ +k\ +co), • • . ,7('iog« + ^iogn + Qogn-i)) will equal 
T(i + k) when interpreting i,k as integers in the natural way. 

Lemma|4|then implies that for all z, G F log " +logm , 

Vi(z)= £ Pio g n + i ogm (z,(i,k))-W^(i,k) 

(i,k)£{0,\y o s»x{0,\y o z>" 

Y, Plogn+logm(z, (i,k))-<t>(i,k,c) ■ (T(f(h+kl+Co),. ■ .,?(hogn+kogn+Clogn-\)) ~P(ju- ■ ■ Jlogm)) 
((,yt ! c)e{0,l} to g"x{0,l} 1 °g» 1 x{0,l} 1 °s" 

Therefore, in order to reduce a claim about Ve(z) to a claim about f{r\) and Pfo) for random vectors 
r\ G F log " andr2 G F logm , it suffices to apply the sum-check protocol to the 2 log n+ logm-variate polynomial 

g(/,^,c)=A og „ + l ogm (z,(/,^))-<I>(/,^,c)-(f(7(/i+/:i+Co),...,f(/log„+^og n +Qog»-l))-^(7l,---,7logm))- 

It remains to show how to extend the techniques underlying Theorem[T]to allow V to compute all of the 
required messages in this sum-check protocol in 0(nm) time. For brevity, we restrict ourselves to a sketch 
of the techniques. 

The first obvious complication is that the sum defining P's message in a given round of the sum-check 
protocol has as many as 2 21og "+ log,n = Q.(mn 2 ) > nm terms. Fortunately, the <J> polynomial ensures that 
almost all of these terms are zero: when considering any Boolean setting of the variables ij,kj, and c/_i, 
the only setting of Cj that V must consider is the one corresponding to the carry bit of ij + kj + c y -_i i.e. 
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the unique setting of Cj such that <p(ij,kj,Cj-i,cj) = 1. This ensures that at round 3j, 3j + 1, and 3j + 2 
of the sum-check protocol applied to g, V must only evaluate g at o(2 l ° sn+logm ~J) terms, which is falling 
geometrically quickly with j. 

We now turn to explaining how V can evaluate g at all necessary points in round 3j, 3j + 1 and 3j + 2 in 
total time o{2 Xogn+logm ^^). To accomplish this, it is sufficient for V to evaluate j3i g«+iogm at the necessary 
points, as well as <1>, f, and P at the necessary points. The Aogn+iogm evaluations are handled exactly as 
in Theorem 111 i.e. by using C^' arrays (but these arrays only get updated every time a variable ij or kj 
gets bound within the sum-check protocol; no update is necessary when a variable Cj gets bound). The P 
evaluations are also handled exactly as in Theorem [l] using V^ arrays that only need to be updated when a 
variable kj gets bound. 

The t evaluations require some additional explanation on top of the analysis of Theorem [j] We want V 
to be able to use V^ arrays as in Theorem[l|to evaluate T at the necessary points in constant time per point, 
but we need to make sure that V can compute array V^') from V^ -1 ) in time that falls geometrically quickly 
with j. In order to do this, it is essential to choose a specific ordering for the sum in the sum-check protocol. 

Specifically, we write the sum as: 

'1 k\ C\ 12 kl c 2 hogn c logn 

This ensures that e.g. (i\ ,k\,c\) are the first three variables in the sum-check protocol to become bound 
to random values in F. The reason we must do this is so that every 3 rounds, another value f(ij + kj + 
Cj-\) feeding into t becomes bound to a specific value (and moreover the outputs of f(ij> +kji + c/_i) 
are unaffected by the bound variables for all / > j). This is precisely the property we exploited in the 
protocol of Theorem [ij to ensure that the V^ arrays there halved in size every round, and that V^ could be 
computed from yC/ -I 7jn time proportional to its size. So we can use V^ arrays to efficiently perform the 
T evaluations, updating the arrays every time another value f(ij+kj + Cj-t) feeding into t becomes bound 
to a specific value. 

Finally, the <1> evaluations can be handled as follows. Consider for simplicity round 3j of the protocol. 
Recall that V only needs to evaluate <1> at points for which ^f{i/,kf,Cf^i,Cf) = 1 for all / > j. Thus, 
for all / > j, <pji does not affect the product defining <J>. So in order to evaluate <J> at the relevant points, it 
suffices for V to evaluate the ipj/sior j' < j. Now at round 3j of the protocol, all triples (if,kj',Cj') for/ < j 

are already bound, say to the values (/ ', ,r :, ,rP), and hence all the <j>ji functions for / < j are themselves 
already bound to specific values. So in order to quickly determine the contribution of the (j>fS for / < j to 

the product defining <J>, it suffices for V to maintain the quantity Wji<_j 0/ {/), , r \, , r v ') over the course of 
the protocol, which takes just 0(logn) time in total. Finally, the contribution of <pj to the product defining 
<1> can be computed in constant time per point. This completes the proof that <1> can be evaluated by V at all 
of the necessary points in 0(1) time per point over all rounds of the sum-check protocol, and completes the 
proof of the theorem. 



49 



